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Foreword 



The ever-expanding growth of Information Technology continues to place fresh 
demands on the management of data. Database researchers must respond to new 
challenges, particularly to the opportunities offered by the Internet for access to 
distributed, semi-structured and multimedia data sources. 

This volume contains the proceedings of the 18* British National Conference on 
Databases (BNCOD 2001), held at the Rutherford Appleton Laboratory in July 2001. 
In recent years, interest in this conference series has extended well beyond the UK. In 
selecting just eleven of the submitted papers for presentation, the programme 
committee has included contributors from The Netherlands, Germany, Sweden, 
Canada and USA. In addition, two specially invited speakers address subjects of 
topical interest. 

Our first invited speaker is Professor Dr. Rudi Studer from the University of 
Karlsruhe. At AIFB, the Institute for Applied Informatics and Formal Description 
Methods, he and his colleagues are in the forefront of work on the Semantic Web. 
This aims to make information accessible to human and software agents on a semantic 
basis. The paper discusses the role that semantic structures, based on ontologies, play 
in establishing communication between different agents. The AIFB web site has been 
developed as a semantic portal to serve as a case study. 

The massive increase in data volumes from big science such as remote sensing and 
high energy physics means that we now contemplate the storage and processing of 
petabytes. Grid technology, specifically the „Data Grid“ is seen as attractive. It is 
thus timely that our second invited speaker addresses strategy in this field. He is 
Professor Tony Hey, now recently appointed as Director of the UK e-Science Core 
Programme and well placed to expound the vision. 

The contributed papers are presented in four groups. The first of these addresses 
performance and optimisation. This issue has always been at the core of database 
technology. The first paper, by Regan and Delis, reports on a practical study of space 
management in logs. They evaluate a technique for reclaiming log space from short 
transactions while retaining recoverability for long running ones. The increasing 
popularity of XML presents new challenges. Zhu and Lii propose an algorithm for an 
effective storage placement strategy for XML documents that facilitates their efficient 
parallel processing. The trade off between data quality and performance is an 
interesting topic tackled by Caine and Embury. They study algorithms for integrity 
checking delayed from when the system is too busy to off-peak, „lights out“ hours. 

The second group of papers concentrates on objects in databases and software 
engineering. The great variety of CASE tools prompt the adoption of standardised 
meta-models and transfer formats. In proposing an extension to OCL, Gustavsson 
and Lings further the interchange of models by defining a common, model 
independent notation for design transformations. Next, Zhang and Ritter investigate 
the state of database support for software development using object-oriented 
programming languages. They highlight the shortcomings in this respect of the 
current object-relational database paradigm and suggest how it might beneficially be 
enhanced. The third paper returns to the engineering design environment and tackles 
concurrent version control. Al-Khudair, Gray and Miles present a generalised object- 
oriented model that captures the evolution of design configurations and their 
components by supporting versioning at all levels. 
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Foreword 



In the third group of papers, we again consider optimisation. More specifically, 
contributors consider efficient querying in the newer domains of multimedia and 
distributed data sources. The requirements and techniques of the worlds of 
information retrieval and transactional databases are very different. The Dutch team 
of Blok, de Vries, Blanken and Apers present a case study on the „top-N“ queries 
familiar in content retrieval in the context of a database approach to the management 
of multimedia data. The key issues addressed, such as speed and quality of answers 
and the opportunities for scalability are supported by experimental results. A similar 
problem is of concern to Sattler, Dunemann, Geist, Saake and Conrad. They seek 
control over the potentially excessive data returned from a query over heterogeneous 
data sources. By extensions to multi-database languages, they explore ways of asking 
for just the „first n“ results, or of asking for a sample of the complete result. Still with 
the theme of information systems relying on database technology, Waas and Kersten 
are concerned with a web multimedia portal based on the Monet database system. 
Here the optimisation challenge is query throughput. The authors report on the 
performance of a simple and robust scheme for the scheduling of queries in a large, 
parallel, shared-nothing database cluster. 

The two papers in our final group are both about querying objects. However, they 
are very different. Trigoni and Bierman present an inference algorithm for OQL that 
identifies the most general type of a query in the absence of schema type information. 
This is relevant to where heterogeneity is encountered - for example, in any open, 
distributed, or even semi-structured, database environment. Distributed databases and 
virtual reality are combined in the ambitious work reported by Ammoura, Zaiane and 
Ji. They explore data mining in a virtual data warehouse. Rendering multi- 
dimensional data aggregates as objects, the user flies through the data to explore and 
query different views. 
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Abstract. The core idea of the Semantic Web is to make information 
accessible to human and software agents on a semantic basis. Hence, 
web sites may feed directly from the Semantic Web exploiting the un- 
derlying structures for human and machine access. We have developed a 
generic approach for developing semantic portals, viz. SEAL (SEmantic 
portAL), that exploits semantics for providing and accessing information 
at a portal as well as constructing and maintaining the portal. 

In this paper, we discuss the role that semantic structures make for estab- 
lishing communication between different agents in general. We elaborate 
on a number of intelligent means that make semantic web sites accessible 
from the outside, viz. semantics-based browsing, semantic querying and 
querying with semantic similarity, and machine access to semantic infor- 
mation at a semantic portal. As a case study we refer to the AIFB web 
site — a place that is increasingly driven by Semantic Web technologies. 



1 Introduction 

The widely-agreed core idea of the Semantic Web is the delivery of data on a 
semantic basis. Intuitively the delivery of semantically apprehended data should 
help with establishing a higher quality of communication between the informa- 
tion provider and the consumer. How this intuition may be put into practice is 
the topic of this paper. 

We discuss means to further communication on a semantic basis. For this one 
needs a theory of communication that links results from semiotics, linguistics, 
and philosophy into actual information technology. We here consider ontologies 
as a sound semantic basis that is used to define the meaning of terms and hence 
to support intelligent access, e.g. by semantic querying or dynamic hypertext 
views m- 

Thus, ontologies constitute the foundation of our SEAL (SEmantic portAL) 
approach. The origins of SEAL lie in Ontobroker which was conceived for 

B. Read (Ed.): BNCOD 2001, LNCS 2097, pp. 2001. 
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semantic search of knowledge on the Web and also used for sharing knowledge 
on the Web 0. It then developed into an overarching framework for search 
and presentation offering access at a portal site m- This concept was then 
transferred to further applications and is currently extended into a 

commercial solutionQ. 

We here describe the SEAL core modules and its overall architecture (Sec- 
tion 0 . Thereafter, we go into several technical details that are important for 
human and machine access to a semantic portal. 

In particular, we describe a general approach for semantic ranking (Sec- 
tion 0 . The motivation for semantic ranking is that even with accurate semantic 
access, one will often find too much information. Underlying semantic structures, 
e.g. topic hierarchies, give an indication of what should be ranked higher on a 
list of results. 

Finally, we present mechanisms to deliver and collect machine- 
understandable data (Section ED. They extend previous means for better di- 
gestion of web site data by software agents. Before we conclude, we give a short 
survey of related work. 



2 Ontology and Knowledge Base 

For our AIFB intranet, we explicitly model relevant aspects of the domain in 
order to allow for a more concise communication between agents, viz. within the 
group of software agents, between software and human agents, and — last not 
least — between different human agents. In particular, we describe a way of mod- 
eling an ontology that we consider appropriate for supporting communication 
between human and software agents. 



2.1 Ontologies for Communication 

Research in ontology has its roots in philosophy dealing with the nature and 
organisation of being. In computer science, the term ontology refers to an engi- 
neering artifact, constituted by a specific vocabulary used to describe a particu- 
lar model of the world, plus a set of explicit assumptions regarding the intended 
meaning of the words in the vocabulary. Both, vocabulary and assumptions, serve 
human and software agents to reach common conclusions when communicating. 

Reference and meaning. The general context of communication (with or without 
ontology) is described by the meaning triangle [ I . The meaning triangle defines 
the interaction between symbols or words, concepts and things of the world (c/. 
Figure [IJ . 

The meaning triangle illustrates the fact that although words cannot com- 
pletely capture the essence of a reference (= concept) or of a referent (= thing), 
there is a correspondence between them. The relationship between a word and 

^ cf. http://www.time2research.de 
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Fig. 1. The Meaning Triangle 



a thing is indirect. The correct linkage can only be accomplished when an in- 
terpreter processes the word invoking a corresponding concept and establishing 
the proper linkage between his concept and the appropriate thing in the world. 

Logics. An ontology is a general logical theory constituted by a vocabulary and 
a set of statements about a domain of interest in some logic language. The 
logical theory specifies relations between signs and it apprehends relations with 
a semantics that restricts the set of possible interpretations of the signs. Thus, 
the ontology reduces the number of mappings from signs to things in the world 
that an interpreter who is committed to the ontology can perform — in the ideal 
case each sign from the vocabulary eventually stands for exactly one thing in 
the world. 

Figure 0 depicts the overall setting for communication between human and 
software agents. We mainly distinguish three layers: First of all, we deal with 
things that exist in the real world, including in this example human and soft- 
ware agents, cars, and animals. Secondly, we deal with symbols and syntactic 
structures that are exchanged. Thirdly, we analyze models with their specific 
semantic structures. 

Let us first consider the left side of Figure | 2 | without assuming a commit- 
ment to a given ontology. Two human agents HAi and HA2 exchange a specific 
sign, e.g. a word like “jaguar”. Given their own internal model each of them 
will associate the sign to his own concept referring to possibly two completely 
different existing things in the world, e.g. the animal vs. the car. The same holds 
for software agents: They may exchange statements based on a common syntax, 
however, they may have different formal models with differing interpretations. 

We consider the scenario that both human agents commit to a specific ontol- 
ogy that deals with a specific domain, e.g. animals. The chance that they both 
refer to the same thing in the world increases considerably. The same holds for 
the software agents SAi and SA2: They have actual knowledge and they use 
the ontology to have a common semantic basis. When agent SAi uses the term 
“jaguar”, the other agent SA2 may use the ontology just mentioned as back- 
ground knowledge and rule out incorrect references, e.g. ones that let “jaguar” 
stand for the car. Human and software agents use their concepts and their in- 
ference processes, respectively, in order to narrow down the choice of referents 
{e.g., because animals do not have wheels, but cars have). 
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Human 
Agent 1 



Human 
Agent 2 




Machine 


Machine 


Agent 1 


Agent 2 



exchange signs, 
e.g. protocols 

◄ ► 





a specific t 

domain, e.g. Things in the 

animals real world 



Symbols / 
Syntactic structures 



Concepts / 
Semantic structures 



Fig. 2. Communication between human and/or software agents 

A new model for ontologies. Subsequently, we define our notion of ontology. 
However, in contrast to most other research about ontology languages it is not 
our purpose to invent a new logic language or to redescribe an old one. Rather 
what we specify is a way of modeling an ontology that inherently considers 
the special role of signs (mostly strings in current ontology-based systems) and 
references. 

Our motivation is based on the conflict that ontologies are for human and 
software agents, but logical theories are mostly for mathematicians and inference 
engines. Formal semantics for ontologies is a sine qua non. In fact, we build our 
applications on a well-understood logical framework, viz. F-Logic m- However, 
in addition to the benefits of logical rigor, user and developer of an ontology- 
based system profit from ontology structures that allow to elucidate possible 
misunderstandings. 

For instance, one might specify that the sign “jaguar” refers to the union of 
the set of all animals that are jaguars and the set of all cars that are jaguars. 
Alternatively, one may describe that “jaguar” is a sign that may either refer to a 
concept “animal-jaguar” or to a concept “car-jaguar”. We prefer the second way. 
In conjunction with appropriate GUI modules (c/. Sections Off) one may avoid 
presentations of ‘funny symbols’ to the user like “animal-jaguar”, while avoiding 
‘funny inference’ such as may arise from artificial concepts like the union of the 
sets denoted by ‘animal-jaguar’ and ‘car-jaguar’. 



2.2 Ontology vs. Knowledge Base 

Concerning the general setting just sketched, the term ontology is defined — 
more or less — as some piece of formal knowledge. However, there are several 
properties that warrant the distinction of knowledge contained in the ontology 
vs. knowledge contained in the so-called knowledge base, which are summarized 
in Table E 
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Table 1. Distinguishing ontology and knowledge base 





Ontology 


Knowledge base 


Set of logic statements 
Theory 

Statements are mostly 
Construction 
Description logics 


yes 

general theory 
intensional 
set up once 
T-Box 


yes 

theory of particular circumstances 
extensional 
continuous change 
A-Box 



The ontology constitutes a general logical theory, while the knowledge base 
describes particular circumstances. In the ontology one tries to capture the gen- 
eral conceptual structures of a domain of interest, while in the knowledge base 
one aims at the specification of the given state of affairs. Thus, the ontology 
is (mostly) constituted by intensional logical definitions, while the knowledge 
base comprises (mostly) the extensional parts. The theory in the ontology is one 
which is mostly developed during the set up (and maintenance) of an ontology- 
based system, while the facts in the knowledge base may be constantly changing. 
In description logics, the ontology part is mostly described in the T-Box and the 
knowledge base in the A-Box. However, our current experience is that it is not 
always possible to distinguish the ontology from the knowledge base by the log- 
ical statements that are made. In the conclusion we will briefly mention some of 
the problems referring to some examples of following sections. 

The distinctions (“general” vs. “specific”, “intensional” vs. “extensional”, 
“set up once” vs. “continuous change”) indicate that for purposes of develop- 
ment, maintenance, and good design of the software system it is reasonable to 
distinguish between ontology and knowledge base. Also, they describe a rough 
shape of where to put which parts of a logical theory constraining the intended 
semantic models that facilitate the referencing task for human and software 
agents. However, the reader should note that none of these distinctions draw a 
clear cut borderline between ontology and knowledge base in general. Rather, 
it is typical that in a few percent of cases it depends on the domain, the view 
of the modeler, and the experience of the modeler, whether she decides to put 
particular entitities and relations into the ontology or into the knowledge base. 

Both following definitions of ontology and knowledge base specify constraints 
on the way an ontology (or a knowledge base) should be modeled in a particular 
logical language like F-Logic or OIL: 

Definition 1 (Ontology). An ontology is a sign system O := 
{C,!F, 0,0,1-1,71, A), which consists of 

— A lexicon: The lexicon contains a set of signs (lexical entries) for concepts, 
CA , and a set of signs for relations, CA . Their union is the lexicon C := 

— Two reference functions T , Q, with T \ 2^ i— >■ 2“' and 0 ' 2^ i— >■ 2“^. iF 
and 0 link sets of lexical entries {Li} <Z C to the set of concepts and relations 
they refer to, respectively, in the given ontology. In general, one lexical entry 
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may refer to several eoncepts or relations and one concept or relation may 
be refered to by several lexical entries. Their inverses are T~^ and G~^ ■ 

In order to map easily back and forth and because there is a n to m mapping 
between lexicon and concepts /relations, T and G are defined on sets rather 
than on single objects. 

— A set C of concepts: About each C € C exists at least one statement in the 
ontology, viz. its embedding in the taxonomy. 

— A taxonomy TL: Concepts are taxonomically related by the irrefiexive, 
acyclic, transitive relation TL, (TL GC x C). TL{Ci,C/) means that C\ is a 
subconcept of C 2 . 

— A set of binary relations TZ: TZ denotes a set of binary relations^ They 
specify pairs of domain and ranges {D, R) with D,R S C. 

The functions d and r applied to a binary relation Q yield the corresponding 
domain and range concepts D and R, respectively. 

— A set of ontology axioms, A. 

The reader may note that the structure we propose is very similar to the 
WordNet model described by Miller ra . WordNet has been conceived as a mixed 
linguistic / psychological model about how people associate words with their 
meaning. Like WordNet, we allow that one word may have several meanings and 
one concept (synset) may be represented by several words. However, we allow 
for a seamless integration into logical languages like OIL or F-Logic by providing 
very simple means for definition of relations and for knowledge bases. 

We define a knowledge base as a collection of object descriptions that refer 
to a given ontology. 

Definition 2 (Knowledge Base). We define a knowledge base as a 7-tupel 
K.B := {C, ZT,I,}V,S,A,0), that consists of 

— a lexicon containing a set of signs for instances, C. 

— A reference function J with J \ 2^ ^ 2^ . J links sets of lexical entries 
{Li\ G C to the set of instances they correspond to. 

Thereby, names may be multiply used, e.g. “Athens” may be used for 
“Athens, Georgia” or for “Athens, Greece”. 

— a set of instances I. About each Ij, G X,k = 1, . . . ,l exists at least one 
statement in the knowledge base, viz. a membership to a concept C from the 
ontology O. 

— A membership function W with W : 2^ 1 —^ 2^ . W assigns sets of instances 
to the sets of concepts they are members of. 

— Instantiated relations, S, are described, viz. S C {(x,y, z)\x G T, y G 
TZ, z G Tj . 

— A set of knowledge base axioms, A. 

— A reference to an ontology O. 

Overall the decision to model some relevant part of the domain in the ontol- 
ogy vs. in the knowledge base is often based on gradual distinctions and driven 
by the needs of the application. Concerning the technical issue it is sometimes 

^ Here at the conceptual level, we do not distinguish between relations and attributes. 
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even useful to let the lexicon of knowledge base and ontology overlap, e.g. to use 
a concept name to refer to a particular instance in a particular context. In fact 
researchers in natural language have tackled the question how the reference func- 
tion J can be dynamically extended given an ontology, a context, a knowledge 
base and a particular sentence. 

3 SEAL Infrastructure and Core Modules 

The aim of our intranet application is the presentation of information to human 
and software agents taking advantage of semantic structures. In this section, we 
first elaborate on the general architecture for SEAL (SEmantic PortAL), before 
we explain functionalities of its core modules. 

3.1 Architecture 

The overall architecture and environment of SEAL is depicted in Figure 0 
The backbone of the system consists of the knowledge warehouse, i.e. the data 
repository, and the Ontobroker system, i.e. the principal inferencing mechanism. 
The latter functions as a kind of middleware run-time system, possibly mediat- 
ing between different information sources when the environment becomes more 
complex than it is now. 




Fig. 3. AIFB Intranet - System architecture 
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At the front end one may distinguish between three types of agents: software 
agents, community users and general users. All three of them communicate with 
the system through the web server. The three different types of agents correspond 
to three primary modes of interaction with the system. 

First, remote applications {e.g. software agents) may process information 
stored at the portal over the internet. For this purpose, the RDF generator 
presents RDF facts through the web server. Software agents with RDF crawlers 
may collect the facts and, thus, have direct access to semantic knowledge stored 
at the web site. 

Second, community users and general users can access information contained 
at the web site. Two forms of accessing are supported: navigating through the 
portal by exploiting hyperlink structure of documents and searching for infor- 
mation by posting queries. The hyperlink structure is partially given by the 
portal builder, but it may be extended with the help of the navigation module. 
The navigation module exploits inferencing capabilities of the inference engine 
in order to construct conceptual hyperlink structures. Searching and querying is 
performed via the query module. In addition, the user can personalise the search 
interface using the semantic personalization preprocessing module and/or rank 
retrieved results according to semantic similarity (done by the postprocessing 
module for semantic ranking). Queries also take advantage of the Ontobroker 
inferencing. 

Third, only community users can provide data. Typical information they con- 
tribute includes personal data, information about research areas, publications, 
activities and other research information. For each type of information they con- 
tribute there is (at least) one concept in the ontology. Retrieving parts of the 
ontology, the template module may semi-automatically produce suitable HTML 
forms for data input. The community users fill in these forms and the template 
modules stores the data in the knowledge warehouse. 

3.2 Core Modules 

The core modules have been extensively described in m- In order to give the 
reader a compact overview we here shortly survey their function. In the remain- 
der of the paper we delve deeper into those aspects that have been added or 
considerably extended recently, viz. semantic ranking (Section^, and semantic 
access by software agents (Section 0 ■ 

Ontobroker. The Ontobroker system ^ is a deductive, object-oriented database 
system operating either in main memory or on a relational database (via JDBC). 
It provides compilers for different languages to describe ontologies, rules and 
facts. Beside other usage, in this architecture it is also used as an inference engine 
(server). It reads input files containing the knowledge base and the ontology, 
evaluates incoming queries, and returns the results derived from the combination 
of ontology, knowledge base and query. 

The possibility to derive additional factual knowledge from given facts and 
background knowledge considerably facilitates the life of the knowledge providers 
and the knowledge seekers. For instance, one may specify that if a person belongs 
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to a research group of institute AIFB, he also belongs to AIFB. Thus, it is 
unnecessary to specify the membership to his research group and to AIFB. 
Conversely, the information seeker does not have to take care of inconsistent 
assignments, e.g. ones that specify membership to an AIFB research group, but 
that have erronously left out the membership to AIFB. 

Knowledge warehouse. The knowledge warehouse PI serves as repository for 
data represented in the form of F-Logic statements. It hosts the ontology, as 
well as the data proper. From the point of view of inferencing (Ontobroker) 
the difference is negligible, but from the point of view of maintaining the sys- 
tem the difference between ontology definition and its instantiation is useful. 
The knowledge warehouse is organised around a relational database, where facts 
and concepts are stored in a reified format. It states relations and concepts as 
first-order objects and it is therefore very flexible with regard to changes and 
amendments of the ontology. 

Navigation module. Beside the hierarchical, tree-based hyperlink structure which 
corresponds to hierarchical decomposition of domain, the navigation module en- 
ables complex graph-based semantic hyperlinking, based on ontological relations 
between concepts (nodes) in the domain. The conceptual approach to hyperlink- 
ing is based on the assumption that semantic relevant hyperlinks from a web 
page correspond to conceptual relations, such as member Of or hasPart, or to at- 
tributes, like hasName. Thus, instances in the knowledge base may be presented 
by automatically generating links to all related instances. For example, on per- 
sonal web pages (c/. Figure 0 there are hyperlinks to web pages that describe 
the corresponding research groups, research areas and project web pages. 

Query module. The query module puts an easy-to-use interface on the query ca- 
pabilities of the F-Logic query interface of Ontobroker. The portal builder mod- 
els web pages that serve particular query needs, such as querying for projects or 
querying for people. For this purpose, selection lists that restrict query possibili- 
ties are offered to the user. The selection lists are compiled using knowledge from 
the ontology and/or the knowledge base. For instance, the query interface for 
persons allows to search for people according to research groups they are mem- 
bers of. The list of research groups is dynamically filled by an F-Logic query 
and presented to the user for easy choice by a drop-down list (c/. snapshot in 
Figure 0) . 

Even simpler, one may apprehend a hyperlink with an F-Logic query that is 
dynamically evaluated when the link is hit. More complex, one may construct 
an isA, a hasPart, or a hasSubtopic tree, from which query events are triggered 
when particular nodes in the tree are navigated. 

Personalization module. The personalization component allows to provide check- 
box personalization and preference-based personalization (including profiling 
from semantics-based log files). For instance, one may detect that user group 
A is particularly interested in all pages that deal with nature-analog algorithms, 
e.g. ones about genetic algorithms or ant algorithms. 



10 



A. Maedche et al. 



fcimimjijjj 

QatB geatbeilen 



D 






Ejpa 



Q. 

'uche ^ 






Un^ 




Fig. 4. Query form based on definition of concept Person 



Template module. In order to facilitate the contribution of information by com- 
munity users, the template module generates an HTML form for each concept 
that a user may instantiate. For instance, in the AIFB intranet there is an in- 
put template (c/. Figure 0 upper left) generated from the concept definition of 
person (c/. Figure 0 lower left). The data is later on used by the navigation 
module to produce the corresponding person web page (c/. Figure 0 right hand 
side). 

In order to reduce the data required for input, the portal builder specifies 
which attributes and relations are derived from other templates. For example, 
in our case the portal builder has specified that project membership is defined 
in the project template. The co-ordinator of a project enters information about 
which persons are participants of the project and this information is used when 
generating the person web page taking advantage of a corresponding F-Logic 
rule for inverse relationships. Hence, it is unnecessary to input this information 
in the person template. 



Ontology lexicon. The different modules described here make extensive use of 
the lexicon component of the ontology. The most prevalent use is the distinc- 
tion between English and German (realized for presentation, though not for the 
template module, yet). In the future we envision that one may produce more 
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Fig. 5. Templates generated from concept definitions 



adaptive web sites making use of the explicit lexicon. For instance, we will be 
able to produce short descriptions when the context is sufficiently narrow, e.g. 
working with ambiguous acronyms like ASF0 or seaiE 



4 Semantic Ranking 

This section describes the architecture component “Semantic Ranking” which 
has been developed in the context of our application. First, we will introduce 
and motivate the requirement for a ranking approach with a small example 
we are facing. Second, we will show how the problem of semanking ranking 
may be reduced to the comparison of two knowledge bases. Query results are 
reinterpreted as “query knowledge bases” and their similarity to the original 
knowledge base without axioms yields the basis for semantic ranking. Thereby, 
we reduce our notion of similarity between two knowledge bases to the similarity 
of concept pairs 1) . Let us assume the following ontology: 

® Active server pages vs. active service providers. 

^ “SouthEast Asian Linguistics Conference” vs. “Conference on Simulated Evolution 
and Learning” vs. “Society for Evolutionary Analysis in Law” vs. “Society for Ef- 
fective Affective Learning” vs. some other dozens — several of which are indeed 
relevant in our institute. 
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1 : Person :: 0bject[wORKSlN Project]. 

2 : Project :: 0bject[HASTOPlC Topic]. 

3 : Topic :: 0bject[suBTOPlCOF Topic]. (1) 

4 : FORALL X, Y, Z Z[hasTopic -H- T] X[subtopicOf -h- Y] 

and Z[hASTOPIC — H- X]. 

To give an intuition of the semantic of the F-Logic statements, in line 1 one 
finds a concept definition for a Person being an Object with a relation worksIn. 
The range of the relation for this Person is restricted to Project. 

Let us further assume the following knowledge base: 

5 ; KnowledgeManagement : Topic. 

6 : KnowledgeDiscovery ; Topic[sUBTOPIcOF — H- KnowledgeManagement]. 

7 : Gerd : Person ]worksIn — h- OntoWise]. , . 

8 : OntoWise : Project]HASTOPIC — H- KnowledgeManagement]. ^ 

9 : Andreas : Person [ wORKSIn — TelekomProject]. 

10 : TelekomProject : Project[HASTOPIC -H- KnowledgeDiscovery]. 

Definitions of instances in the knowledge base are syntactically very similar 
to the concept definition in F-Logic. In line 6 the instance KnowledgeDiscovery of 
the concept Topic is defined. Furthermore, the relation subtopicof is instantiated 
between KnowledgeDiscovery and KnowledgeManagement. Similarly in line 7, it is 
stated that Gerd is a {concPerson working in OntoWise. Ontology axioms like 
given in line 4 ([5 use this syntax to describe regularities. Line 4 states that if 
some Z has topic X and X is a subtopic of Y then Z also has topic Y. 

Now, an F-Logic query may ask for all people who work in a knowledge 
management project by: 

FORALL Y,Z<r- FIwORKSIn — H- Z] and 

Z : Proyect[HASTOPIC — H- KnowledgeManagement] ^ 

which may result in the tuples := (Gerd, OntoWise) and 

M 2 := (Andreas, TelekomProject). Obviously, both answers are correct with re- 
gard to the given knowledge base and ontology, but the question is, what would 
be a plausible ranking for the correct answers. This ranking should be produced 
from a given query without assuming any modification of the query. 



4.1 Reinterpreting Queries 

Our principal consideration builds on the definition of semantic similarity that 
we have first described in fmm . There, we have developed a measure for the 
similarity of two knowledge bases. Here, our basic idea is to reinterprete possible 
query results as a “query knowledge base” and compute its similarity to the 
original knowledge base while abstracting from semantic inferences. The result 
of an F-Logic query may be re-interpreted as a query knowledge base (QKB) by 
the following approach. 

An F-Logic query is of the form or can be rewritten into the forrrS 
® Negation requires special treatment. 
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FDRALL X P{X,k), (4) 

with X being a vector of variables {Xi , . . . , Al„), k being a vector of constants, 
and P being a vector of conjoined predicates. The result of a query is a two- 
dimensional matrix M of size m x n, with n being the number of result tuples 
and m being the length of X and, hence, the length of the result tuples. Hence, in 
our example above := {Y^Z),k := (‘‘knowledge management ’’), P := (Pi,P 2 ), 
Pi (a, b, c) := a[woRKSlN — 6], P 2 (a, b, c) := 6 [hasTopic — c] and 



M ■= {Ml, M 2 ) 



f Gerd Andreas 

yOntoWise TelekomProjekt 



Now, we may define the query knowledge base i {QKBi) by 



( 5 ) 



QKB,:= P{M„k). (6) 

The similarity measure between the query knowledge base and the given 
knowledge base may then be computed in analogy to m- An adaptation and 
simplification of the measures described there is given in the following together 
with an example. 



4.2 Similarity of Knowledge Bases 

The similarity between two objects (concepts and or instances) may be computed 
by considering their relative place in a common hierarchy H. H may, but need 
not be a taxonomy "H. For instance, in our example from above we have a 
categorization of research topics, which is not a taxonomy! 

Our principal measures are based on the cotopies of the corresponding objects 
as defined by a given hierarchy H, e.g. an ISA hierarchy %, an part-whole 
hierarchy, or a categorization of topics. Here, we use the upwards cotopy (UC) 
defined as follows: 



UC(0„P) := {Oj\H{0,,Oj)VOj = OJ (7) 

UC is overloaded in order to allow for a set of objects M as input instead of only 
single objects, viz. 

UC{M,H) := U {0,\H{0,,0,) V O, = OJ (8) 

OiGM 

Based on the definition of the upwards cotopy (UC) the object match (OM) is 
defined by: 



0M{0i,02,H) 



\VC{0i,H)nVC{02,H)\ 

\VC{0i,H)UVC{02,H)\- 



( 9 ) 



Basically, OM reaches 1 when two concepts coincide (number of intersec- 
tions of the respective upwards cotopies and number of unions of the respective 
cotopies is equal); it degrades to the extent to which the discrepancy between 
intersections and unions increases (a OM between concepts that do not share 
common superconcepts yields value 0). 
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Example. We here give a small example for computing UC and OM based on 
a given categorization of objects H. Figure El depicts the example scenario. The 
upwards cotopy 

UC(knowledge discovery, i?) is given by {knowledge discovery, 

knowledge management}. The upwards COtopy UC(optimization, Ff) computes to 

{opt imization}. 




Fig. 6. Example for computing UC and OM 



Computing the object match OM between KnowledgeManagement and 
Optimization results in 0, the object match between KnowledgeDiscovery and 
CSCW computes to |. 

The match introduced above may easily be generalized to relations using a 
relation hierarchy iJ/j. Thus, the predicate match (PM) for two n-ary predicate 
Pi, P 2 is defined by a mean value. Thereby, we use the geometric mean in order 
to reflect the intuition that if the similarity of one of the components approaches 
0 the overall similarity between two predicates should approach 0 — which need 
not be the case for the arithmetic mean: 



PM(Pi(7i,..../„).P2(Ji,.... J„)) := "+{/OM(Pi . P 2 , Wr) • OM(7i, Ji,P) • . . . • OM(/„, J„,P). 

( 10 ) 



This result may be averaged over an array of predicates. We here simply give 
the formula for our actual needs, where a query knowledge base is compared 
against a given knowledge base KB: 



Simil{QKBi,KB) = Simil{P{Mi, k), KB) := — 



max PM{Pj(Mi,k),Q{Mi,k)). 

_Q(Mi,k)GKB.S 



( 11 ) 



For instance, comparing the two result tuples from our example above with the 
given knowledge base: First, := (Gerd, OntoWise). Then, we have the query 
knowledge base {QKBi)\ 

Gerd[wORKSlN OntoWise] . /loi 

0ntoWise[HASTOPIC — KnowledgeManagement]. ' 

and its relevant counterpart predicates in the given knowledge base {KB) are: 
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Gerd[wORKSlN — >-> OntoWise] . /i 

0ntoWise[HASTOPIC — KnowledgeManagement] . 

This is a perfect fit. Therefore Simil{QK Bi, K B) computes to 1. 

Second, := (Andreas, TelekomProject). Then, we have the query knowledge 
base {QKB 2 )'. 



Andreas [ wORKSIn — TelekomProject]. , 

TelekomProject[HASTOPIC — » KnowledgeManagement]. ' 

and its relevant counterpart predicates in the given knowledge base (KB) are: 



Andreas [ wORKSIn — TelekomProject]. , , 

TelekomProject[HASTOPIC — KnowledgeDiscovery]. 

Hence, the similarity of the first predicates indicates 

a perfect fit and evaluates to 1, but the congruency of 

TelekomProject[HASTOPIC—f> KnowledgeManagement] with 

TelekomProject[HASTOPIC — >-> KnowledgeDiscovery] measures less than 1. The 
instance match of KnowledgeDiscovery and KnowledgeManagement returns | in the 

given topic hierarchy. Therefore, the predicate match returns ^1 • 1 • | « 0.79. 
Thus, overall ranking of the second result is based on |(1 + 0.79) = 0.895. 



Remarks on semantic ranking. The reader may note some basic properties of 
the ranking: (i) similarity of knowledge bases is an asymmetric measure, (ii) the 
ontology defines a conceptual structure useful for defining similarity, (Hi) the 
core concept for evaluating semantic similarity is cotopy defined by a dedicated 
hierarchy. The actual computation of similarity depends on which conceptual 
structures {e.g. hierarchies like taxonomy, part-whole hierarchies, or topic hi- 
erarchies) are selected for evaluating conceptual nearness. Thus, similarity of 
knowledge bases depends on the view selected for the similarity measure. 

Ranking of semantic queries using underlying ontological structures is an 
important means in order to allow users a more specific view onto the underlying 
knowledge base. The method that we propose is based on a few basic principles: 

— Reinterprete the combination of query and results as query knowledge bases 
that may be compared with the explicitly given information. 

— Give a measure for comparing two knowledge bases, thus allowing rankings 
of query results. 

Thus, we may improve the interface to the underlying structures without chang- 
ing the basic architecture. Of course, the reader should be aware that our measure 
may produce some rankings for results that are hardly comparable. For instance, 
results may differ slightly because of imbalances in a given hierarchy or due to 
rather random differences of depth of branches. In this case, ranking may per- 
haps produce results that are not better than unranked ones — but the results 
will not be any worse either. 
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5 RDF Outside — From a Semantic Web Site to the 
Semantic Web 

In the preceding sections we have described the development and the underlying 
techniques of the AIFB semantic web site. Having developed the core applica- 
tion we decided that RDF-capable software agents should be able to understand 
the content of application. Therefore, we have built an automatic RDF gen- 
erator that dynamically generates RDF statements on each of the static and 
dynamic pages of the semantic knowledge portal. Our current AIFB intranet 
application is “Semantic Web-ized” using RDF facts instantiated and defined 
according to the underlying AIFB ontology. On top of this generated and for- 
mally represented metadata, there is the RDF Crawler, a tool that gathers 
interconnected fragments of RDF from the internet. 



5.1 RDF Generator — An Example 

The RDFMaker established in the Ontobroker framework (c/. 0) was a 
starting point for building the RDF Generator. The idea of RDFMaker was, 
that from Ontobroker’s internal data base, RDF statements are generated. 

RDF Generator follows a similar approach and extends the principal ideas. 
In a first step it generates an RDF(S)-based ontology that is stored on a specific 
XML namespace, e.g. in our concrete application 

http : //ontobroker . semanticweb . org/ontologies/aifb-onto-2001-01-01 .rdf s. Additionally, it 

queries the knowledge warehouse. Data, e.g. for a person, is checked for con- 
sistency, and, if possible, completed by applying the given F-Logic rules. We 
here give a short example of what type of data may be generated and stored on 
a specific homepage of a researcher: 



<rdf :RDF 

xmlnsirdf = "http://www.w3.Org/1999/02/22-rdf-syntax-ns#" 

xmlns : aifb = "http: //ontobroker . semanticweb. org/aifb-2001-01-01 .rdf s#"> 

<aifb :PhDStudent rdf : ID="per : ama"> 

<aif b : name>Alexander Maedche</aif b : name> 

<aif b : email>ama@aif b . uni-karlsruhe . de</aif b : email> 

<aif b : phone>+49- (0)721-608 6558</aif b : phone> 

<aif b : f ax>+49- (0)721-608 6580</aif b : f ax> 

<aif b : homepage>http : / /www . aifb . uni-karlsruhe . de/WBS/ ama</aif b : homepage> 
<aif b : supervisor 

rdf : resource = "http: //www. aifb .uni-karlsruhe . de/studer .html#per : rst "/> 
</aif b : PhDStudent> 

</rdf :RDF> 



RDF Generator is a configurable tool, in some cases one may want to use 
inferences to generate materialized, complete RDF descriptions on a home page, 
in other cases one may want to generate only ground facts of RDF. Therefore, 
RDF Generator allows to switch axioms on and off in order to adopt the 
generation of results to varying needs. 
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5.2 RDF Crawler 

The RDF Crawlee0 is a tool which downloads interconnected fragments of 
RDF from the internet and builds a knowledge base from this data. Building an 
external knowledge base for the whole AIFB (its researcher, its projects, its pub- 
lications, . . . ) becomes easy using the RDF Crawler and machine-processable 
RDF data currently defined on AIFB‘s web. We here shortly describe the under- 
lying techniques of our RDF Crawler and the process of building a knowledge 
base. In general, RDF data may appear in Web documents in several ways. We 
distinguish between pure RDF (files that have an extension like “*.rdf”), RDF 
embedded in HTML and RDF embedded in XML. Our RDF Crawler uses 
RDF-APfl that can deal with different embeddings of RDF described above. 

One problem of crawling is the applied filtering mechanism: Baseline crawlers 
are typically restricted by a given depth value. Recently several new research 
work on so-called focused crawling has been published {e.g. cf. |3|). In their 
approach, they use a set of predefined documents associated with topics in a Ya- 
hoo like taxonomy to built a focused crawler. Two hypertext mining algorithms 
constitute the core of their approach. A classifier evaluates the relevance of a 
hypertext document with respect to the focus topics and a distiller identifies 
hypertext nodes that are good access points to many relevant pages within a 
few links. In contrast, our approach uses ontological background knowledge to 
judge the relevance of each page. If a page is highly relevant, the crawler may 
follow the links on the particular web site. If RDF data is available on a page, 
we judge relevance with respect to the quantity and quality of available data 
and by the existing URLs. 



Example: Erdoes numbers. As mentioned above we here give a small example 
of a nice application that may be easily built using RDF metadata taken from 
AIFB using the RDF Crawler. The so-called Erdoes numbers have been a part 
of the folklore of mathematicians throughout the world for many yearfl 

Scientific papers are frequently published with co-authors. Based on informa- 
tion about collaboration one may compute the Erdoes number (denoted PE{R)) 
for a researcher R. In the AIFB web site the RDF-based metadata allows for 
computing estimates of Paul Erdoes numbers of AIFB members. The numbers 
are defined recursively: 

1. PE{R) = 0, iff i? is Paul Erdoes 

2. PE{R) =mm{PE{Ri) + 1} else, where Ri varies over the set of all re- 
searchers who have collaborated with R, i.e. have written a scientific paper 
together. 



° RDF Crawler is freely available for download at 
http: / / ontobroker.semanticweb.org/ rdfcrawler. 

^ RDF-API is freely available at http://www-db. Stanford. edu/''melnik/rdf/api.html. 
® The interested reader may have a look at 
http://www.oakland.edu/~grossman/erdoshp.html for an overall project overview. 
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To put this into work, we need lists of publications annotated with RDF facts. 
The lists may be automatically generated by the RDF Generator. Based on 
the RDF facts one may crawl relevant information into a central knowledge base 
and compute these numbers from the data. 

6 Related Work 

This section positions our work in the context of existing web portals and also 
relates our work to other basic methods and tools that are or could be deployed 
for the construction of community web portals, especially to related work in the 
area of semantic ranking of query results. 

Related Work on Knowledge Portals. One of the well-established web por- 
tals on the web is YahocH In contrast to our approach Yahoo only utilizes a very 
light-weight ontology that solely consists of categories arranged in a hierarchical 
manner. Yahoo offers keyword search (local to a selected topic or global) in addi- 
tion to hierarchical navigation, but is only able to retrieve complete documents, 
i.e. it is not able to answer queries concerning the contents of documents, not to 
mention to combine facts being found in different documents or to include facts 
that could be derived through ontological axioms. Personalization is limited to 
check-box personalization. We get rid of these shortcomings since our portal 
is built upon a rich ontology enabling the portal to give integrated answers to 
queries. Furthermore, our semantic personalization features provide more flexible 
means for adapting the portal to the specific needs of its users. 

A portal that is specialized for a scientific community has been built by the 
Math-Net project At http://www.math-net.de/ the portal for the (German) 
mathematics community is installed that makes distributed information from 
several mathematical departments available. This information is accompanied 
by meta-data according to the Dublin Core 0 Standard The Dublin Core 
element “Subject” is used to classify resources as conferences, as research groups, 
as preprints etc. A finer classification {e.g. via attributes) is not possible except 
for instances of the publication category. Here the common MSC-ClassificatiorO 
is used that resembles a light-weight ontology of the field of mathematics. With 
respect to our approach Math-Net lacks a rich ontology that could enhance the 
quality of search results (esp. via inferencing), and the smooth connection to the 
Semantic Web world that is provided by our RDF generator. 

The Ontobroker project 0 lays the technological foundations for the AIFB 
portal. On top of Ontobroker the portal has been built and organizational struc- 
tures for developing and maintaining it have been established. Therefore, we 
compare our system against approaches that are similar to Ontobroker. 

The approach closest to Ontobroker is SHOE | 7 ]. In SHOE, HTML pages 
are annotated via ontologies to support information retrieval based on semantic 
information. Besides the use of ontologies and the annotation of web pages the 

® http://www.yahoo.com 

http://www.purl.org/dc 

cf. Mathematical Subject Classification; http://www.ams.org/msc/ 



SEAL — A Framework for Developing SEmantic Web PortALs 



19 



underlying philosophy of both systems differs significantly: SHOE uses descrip- 
tion logic as its basic representation formalism, but it offers only very limited 
inferencing capabilities. Ontobroker relies on Frame-Logic and supports com- 
plex inferencing for query answering. Furthermore, the SHOE search tool nei- 
ther provides means for a semantic ranking of query results nor for a semantic 
personalization feature. A more detailed comparison to other portal approaches 
and underlying methods may be found in m- 



Related Work on Semantic Similarity. Since our semantic ranking is based 
on the comparison of the query knowledge base with the given ontology and 
knowledge base, we relate our work to the comparison of ontological structures 
and knowledge bases (covering the same domain) and to measuring the similar- 
ity between concepts in a hierarchy. Although there has been a long discussion 
in the literature about evaluating knowledge-bases m, we have not found any 
discussion about comparing two knowledge bases covering the same domain that 
corresponds to our semantic ranking approach. Similarity measures for ontologi- 
cal structures have been investigated in areas like cognitive science, databases or 
knowledge engineering (c/. e.g., jiyiltillbiT) !. However, all these approaches are 
restricted to similarity measures between lexical entries, concepts, and template 
slots within one ontology. 

Closest to our measure of similarity is work in the NLP community, named 
semantic similarity nn which refers to similarity between two concepts in a isA- 
taxonomy such as the WordNet or CYC upper ontology. Our approach differs in 
two main aspect from this notion of similarity: Firstly, our similarity measure is 
applicable to a hierarchy which may, but not need be a taxonomy and secondly 
it is taking into account not only commonalties but also differences between 
the items being compared, expressing both in semantic-cotopy terms. This sec- 
ond property enables the measuring of self-similarity and subclass-relationship 
similarity, which are crucial for comparing results derived from the inferencing 
processes, that are executed in the background. 

Conceptually, instead of measuring similarity between isolated terms (words), 
that does not take into account the relationship among word senses that matters, 
we measure similarity between “words in context”, by measuring similarity be- 
tween Object-Attribute-Value pairs, where each term corresponds to a concept 
in the ontology. This enables us to exploit the ontological background knowl- 
edge (axioms and relations between concepts) in measuring the similarity, which 
expands our approach to a methodology for comparing knowledge bases. 

From our point of view, our community portal system is rather unique with 
respect to the collection of methods used and the functionality provided. We 
have extended our community portal appraoch that provides flexible means for 
providing, integrating and accessing information m by semantic personalization 
features, semantic ranking of generated answers and a smooth integration with 
the evolving Semantic Web. All these methods are integrated into one uniform 
system environment, the SEAL framework. 
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7 Conclusion 

In this paper we have shown our comprehensive approach SEAL for building 
semantic portals. In particular, we have focused on three issues. 

First, we have considered the ontological foundation of SEAL. There, we have 
made the experience that there are many big open issues that have hardly been 
dealt with so far. In particular, the step of formalizing the ontology raises very 
principal problems. The issue of where to put relevant concepts, viz. into the 
ontology vs. into the knowledge base, is an important one that deeply affects 
organization and application. However, there exist no corresponding method- 
ological guidelines to base the decision upon so far. For instance, we have given 
the example ontology and knowledge base in (P^) and 0. Using description logics 
terminology, we have equated the ontology with the “T-Box” and we have put 
the topic hierachy into the knowledge base ( “A-Box” ) . An alternative could have 
been to formalize the topic hierarchy as an isA-hierarchy, which however it isn’t 
and put it into the T-Box. We believe that both alternatives exhibit an internal 
fault, viz. the ontology should not be equated with the T-Box, but rather should 
its scope be independent from an actual formalization with particular logical 
statements. Its scope should to a large extent depend on soft issues, like “Who 
updates a concept?” and “How often does a concept change?” such as already 
indicated in Table ^ Second, we have described the general architecture of the 
SEAL approach, which is also used for our real-world case study, the AIFB web 
site. The architecture integrates a number of components that we have also used 
in other applications, like Ontobroker, navigation or query module. Third, we 
have extended our semantic modules to include a larger diversity of intelligent 
means for accessing the web site, viz. semantic ranking and machine access by 
crawling. 

For the future, we see a number of new important topics appearing on the 
horizon. For instance, we consider approaches for ontology learning H2| in or- 
der to semi-automatically adapt to changes in the world and to facilitate the 
engineering of ontologies. Currently, we work on providing intelligent means 
for providing semantic information, i.e. we elaborate on a semantic annota- 
tion framework that balances between manual provisioning from legacy texts 
{e.g. web pages) and information extraction P2|- Given a particular conceptu- 
alization, we envision that one wants to be able to use a multitude of different 
inference engines taking advantage of different inferencing capabilities (tempo- 
ral, non-monotonic, high scalability, etc.). Then, however, one needs means to 
change from one representation paradigm to the next 12D|. 

Finally, we envision that once semantic web sites are widely available, their 
automatic exploitation may be brought to new levels. Semantic web mining 
considers the level of mining web site structures, web site content, and web site 
usage on a semantic rather than at a syntactic level yielding new possibilities, 
e.g. for intelligent navigation, personalization, or summarization, to name but a 
few objectives for semantic web sites 0. 
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Abstract. The relentless advance of Moore’s Law for both processor and 
memory chips will continue to transform both the academic and commercial 
world for some years to come. Multi-Teraflop parallel computers and Petabyte 
databases will increasingly become the norm for frontier research problems. 
Even more startling has been the growth in bandwidth so that a ‘Trans-Atlantic 
Terabit Testbed’ is now under discussion based on WDM optical fibre 
technology and Erbium Doped Eibre Amplifiers. The scale of these enterprises 
will also mean that such large science problems are likely to become 
increasingly international and multidisciplinary. This is the vision of ‘e- 
Science’ proposed by John Taylor, Director General of the UK Office of 
Science and Technology. To support such global e-Science collaborations a 
more general infrastructure is required. Not only do researchers need to access 
html web pages for information but also they need transparent remote access to 
computing resources, data repositories and specialist experimental facilities. 
The Grid of Foster and Kesselman aims to provide such an environment. The 
Grid is an ambitious project that has immense momentum and support world- 
wide. In the UK, the OST and the DTI are investing over £I20M in helping 
make this Grid vision become a reality. It is vital that sensible and secure lower 
level infrastructure standards are agreed upon and high-quality Open 
Source/Open Standard middleware delivered as soon as possible. Once these 
are in place there will be great scope for innovative research into new tools to 
support data curation, information management and knowledge discovery. 
Making dependable and secure Grid software to manage and exploit the 
intrinsically dynamic and heterogeneous Grid environment is an exciting 
challenge for many sections of the UK academic Computer Science community. 
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Abstract. Extended Ephemeral Logging (XEL) is a database logging 
and recovery technique which manages a log of recovery data by par- 
titioning it into a series of logically circular generations. XEL copies 
longer-lived log data from one generation to another in order to reclaim 
more quickly the space occupied by shorter-lived log data. As a result 
of copying, records in the log lose their original ordering; this leads to 
main-memory and log space overhead for obsolete recovery data. In this 
paper, we quantify the effects of reordering log records by introducing 
the notion of Garbage Removal Dependencies (GRDs). We develop a 
classification of log records based on GRDs and use it to characterize 
main-memory and log space allocation during normal system operation. 
Through simulation, we demonstrate how main-memory and log space 
allocation vary with changes in database and workload parameters. 



1 Introduction 

Database systems use data written to a nonvolatile log to recover from trans- 
action, system, and media failures |5|. Logged data that is used to recover from 
transaction and system failures is ephemeral; that is, it only needs to be retained 
until the updates of committed transactions are propagated to the database and 
the updates of aborted transactions are removed from the database. When the 
database becomes current with respect to a given log record, that log record 
is obsolete and is no longer required for recovery. In order to reduce recovery 
time — and in particular, system recovery time — it is necessary to manage 
the log during normal system operation so that obsolete log records are removed 
from the log or are recorded as being obsolete. The goal is to do this with as 
little impact as possible to transaction processing performance. 

Ideally, from the point of view of recovery, log records should be removed 
from the log as soon as they become obsolete. However, this is not possible, 
since in-place updates to the log are prohibited. This means that at any given 
time, the log will contain some number of obsolete log records. Obsolete log 
records increase the size of the log and may result in additional overhead during 
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normal system operation and system recovery. For example, system recovery 
processing may spend time trying to determine that a given log record is obsolete 
(by inspecting some portion of the log, database, or both) or it may re-apply a 
change that is already present in the database. 

To reduce overhead, steps must be taken to limit the number of obsolete 
records residing in the log. Since log records can’t be removed individually from 
arbitrary places within the log, they are removed periodically in contiguous 
groups from the beginning of the log — through a process known as truncation. 
The log can be truncated only up to its oldest “live” log record; thus, the rate 
of truncation depends on the lifetimes of log records. 

Although log records are ephemeral, some may be longer-lived than others. 
The lifetime of a log record depends on such factors as transaction duration, the 
rate at which individual objects are updated, and the rate at which updates are 
flushed to the database. For example, the “after image” of a frequently updated 
object is short-lived, whereas the “before image” of an object updated by a long 
running transaction is longer-lived. 

Through the course of normal transaction processing activity, both short- 
lived and long-lived log records will become intermingled throughout the log; 
thus, contiguous segments of obsolete data will not form “naturally.” A longer- 
lived log record will prevent the truncation of obsolete log records that follow 
it, thus increasing the number of obsolete records that are retained in the log. 
Therefore, explicit actions are required to deal with long-lived log records so that 
they do not cause the amount of log to be recovered to grow without bound. 

One way to handle the effects of long-lived log records is to record additional 
information during normal system operation which allows recovery processing 
to ignore or bypass obsolete log records. For example, in the ARIES logging and 
recovery method C3, information useful in determining which log records are 
obsolete is periodically written to the log in an operation called a checkpoint. 
The location of the most recent checkpoint information in the log is recorded 
durably outside of the log itself so that the checkpoint can be located quickly 
upon system recovery. In addition, log records for a given transaction are chained 
together (each new log record is written with a “pointer” to the prior record) so 
that live log records for a transaction can be located directly without scanning 
the log. Although the log space may be longer than it need be due to linger- 
ing obsolete log records, checkpointing and log record chaining can reduce the 
amount of recovery time spent processing obsolete log records. 

Another way to deal with long-lived log records is to force them to become 
obsolete so that larger contiguous segments of the log can be truncated. The 
most intrusive options include flushing a change to the database just to make 
REDO information obsolete or aborting a long running transaction and undoing 
its updates just to make UNDO information obsolete. A less intrusive option is 
to copy log records from the beginning of the log (the head) to the end of the 
log (the tail) to make their original copies obsolete. 

An approach based on copying log records is described in P]. In this method, 
the physical log space is managed as logically circular. Logical head and tail 
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pointers are maintained which track the oldest and newest log records, respec- 
tively. New log records are written to the tail, and whenever the tail gets within 
a certain threshold of the head, live log records at the head are copied to the 
tail Q (whenever the head and tail pointers reach the physical end of the log, 
they are advanced to the physical beginning of the log) . Copying live log records 
from the head allows the log to be truncated — that is, overwritten with new 
log data. By truncating the log on a regular basis, overall log size is reduced 
by reducing the amount of time shorter- lived log records reside in the log. A 
potential drawback to this approach is that long-lived log records may need to 
be copied repeatedly. 

Another logging method based on copying log records is Extended Ephem- 
eral Logging (XEL) 0. To reclaim obsolete log space more aggressively and 
make the log even smaller, XEL manages the log as a series of logically circular 
generations. XEL reclaims log space at a rate commensurate with log record 
lifetimes; space occupied by shorter-lived log records is reclaimed more quickly 
than space occupied by longer- lived log records. By successively copying longer- 
lived log records from one generation to the next and truncating each generation 
separately, XEL attempts to reduce log size while reducing the number of times 
longer-lived log records must be copied. By keeping the log small, the entire 
log space can be recovered in its entirety; thus, there is no need for techniques 
such as checkpointing and log record chaining to help bypass obsolete log records. 

In this paper, we study the effects that reordering log records has on an 
XEL-managed log. We introduce the notion of Garbage Removal Dependencies 
( GRDs ) to explain why XEL incurs main-memory and log space overhead for 
obsolete recovery data. We develop a classification of log records based on 
GRDs and use it to characterize main-memory and log space allocation during 
normal system operation. Through simulation on an XEL implementation we 
have developed, we demonstrate how main-memory and log space allocation 
vary with changes in database and workload parameters. Our main result is 
showing that main-memory overhead for obsolete recovery data is proportional 
to the amount of log space not occupied by live recovery data. 



1.1 Database System Architecture 

We are evaluating XEL in the context of disk-resident database systems con- 
sisting of pages of objects. Recently accessed pages are cached in (volatile) 
main-memory and managed by the cache manager. Transactions update ob- 
jects in-place; that is, directly in the cache. Objects are accessed by obtaining 
object locks in accordance with the strict two-phase locking (strict 2PL) pro- 
tocol. For each individual access to a locked object, the page latch m for the 
object’s page is held. We say that an updated cache object (and its containing 

^ Only the log records of active transactions ever need be copied in this method; log 
records of completed transactions are guaranteed to have had their corresponding 
changes flushed to the database by the time the head is reached. 
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page) is dirty 0 if the transaction that updated it is still active (we say the 
object is clean when the transaction completes). Updated cache pages are even- 
tually made persistent when they are flushed to the disk-resident database (as 
a result of demand paging) . Pages can be flushed regardless of whether they are 
dirty or clean (i.e., the STEAL/NO-FORCE policy is used |2). 

A disk-resident log managed by the log manager captures log records which 
describe recent updates to the database. Log records are written to a main- 
memory buffer (one buffer per generation) before eventually being forced to disk. 
Each generation of the log (and its associated main-memory data structures) is 
serialized by its own log lock p. Logged information is used by the recovery 
manager in the event of failure to restore the database to a state that reflects 
only (and all) committed updates. 



2 XEL 

XEL is a logging and recovery method that protects a database against transac- 
tion and system failures (an integrated media recovery scheme is not supported) . 
It manages the log for databases consisting of objects which need not contain se- 
quence numbers (i.e., timestamps). It is designed to be most effective for database 
workloads generating some long-lived but mostly short-lived log records. 

XEL partitions a log space statically into n fixed-size generations, with gen- 
erations allocated across one or more disks (assuming a disk-based log is used). 
A log record is initially written to the first generation (Go) and is copied to sub- 
sequent generations as required. If the log record is short-lived, it should become 
obsolete during its stay in Go; if so, the space it occupies is reclaimed automat- 
ically when Go is truncated. If the log record remains live when its portion of 
Go is due for truncation, it is copied to Gi (forwarded) to make its original 
copy obsolete. Subsequently, if the log record remains live when its portion of 
Gi is due for truncation, it is copied to G2. This is repeated as necessary until 
the log record becomes obsolete or reaches the last generation (G„_i); if the 
log record reaches G„_i, it is copied within G„_i (recirculated) as often as 
necessary until it becomes obsolete. If the entire log space is allocated to only a 
single generation (which XEL allows), live log records are copied within Gq. 

XEL logs three types of log records: REDO Data Log Records (REDO 
DLRs), UNDO Data Log Records (UNDO DLRs), and Commit Transaction 
Log Records (COMMIT TLRs). Logging is done physically; in particular, en- 
tire object images are logged. A REDO DLR contains the after image (AFIM) 
of an object, and is written to the log whenever an object is updated. All REDO 
DLRs for a transaction are logged before the transaction’s COMMIT TLR is 
logged (the force-log-at-commit rule P). An UNDO DLR contains the before 

^ Unfortunately, the term dirty has two different meanings in the recovery literature. 
One definition (e.g., m) defines cached data as dirty if it has been changed but not 
flushed to disk. Another definition (e.g., P) defines cached or disk-resident data as 
dirty if it contains the updates of an uncommitted transaction. We adopt the latter 
definition in this paper. 
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image (BFIM) of an object, and is written to the log whenever an uncommitted 
change to an object is about to be written to the database (the latter being the 
write-ahead logging (WAL) protocol P). A single transaction may log multiple 
REDO DLRs for a given object; however, only a single UNDO DLR (maximum) 
may be logged for a given object for a given transaction. Since XEL copies log 
records around in the log, each DLR contains an object-level timestamp so that 
DLRs can be put in their correct order during system recovery. 

XEL maintains a main- memory resident (i.e., volatile) “directory” to track 
the contents of the log. This directory consists of two main components — the 
Logged Object Table (LOT), which tracks DLRs by object, and the Logged 
Transaction Table (LTT), which tracks TLRs by transaction. An entry exists 
in the directory for each live log record and any log record otherwise relevant to 
proper recovery of the database. Each entry points to the location of a log record 
on disk. The directory is updated continuously as log records are written, made 
obsolete, or removed from the log. The directory itself is not recorded durably 
since a “semantically equivalent” version of it can be reconstructed from the log 
in the event of a system crash. 



3 Lifetime of Log Data 

When we refer to the lifetime of logged recovery data we must consider both 
the logical lifetime and physical lifetime of log records. The logical lifetime 
of a log record is the span of time during which the log record is required for 
recovery of the database. A log record becomes logically live when it is written 
to the log and becomes logically dead when it is no longer needed for recovery. 
The logical lifetime of a log record is determined by transaction activity, cache 
management, and as we’ll see, log management. The physical lifetime of a log 
record is the span of time during which the log record resides in the log. A 
log record becomes physically live when it is written to the log and becomes 
physically dead when it is removed from the log. The physical lifetime of a log 
record is determined by its logical lifetime and the rate at which logically dead 
log records are removed from the log. 

Taking both the logical and physical lifetimes of log records into account, 
we classify a log record’s lifetime into three phases: live, garbage (heretofore 
referred to as obsolete), and dead. A log record is live when it is both logically 
and physically live, garbage when it is logically dead but physically live, and 
dead when it is both logically and physically dead. Garbage log records exist 
because log records can be removed only through truncation. A log record 
is garbage when it is no longer needed for recovery but is present in the log 
nonetheless. 
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3.1 Ideal Logical Lifetimes 

We first consider the ideal logical lifetimes of log records; that is, logical life- 
times when only transaction and cache management activity is considered (the 
effects of log management — for example, copying live log records and removing 
garbage log records — are ignored for now). Under these conditions, live log 
records become garbage when transactions commit or abort, when objects are 
flushed to the database, and when a new log records supersede them. For exam- 
ple: a REDO DLR written by a transaction becomes garbage when the update 
it describes is flushed to the database, an UNDO DLR written by a transaction 
becomes garbage when the transaction commits, and a COMMIT TLR written 
by a transaction becomes garbage when the last REDO DLR for the transaction 
becomes garbage. 

Before we can summarize the types of live log records, we need to introduce 
some definitions. We say that a REDO DLR is dirty if the transaction that 
wrote it is still active, and we say that a REDO DLR is clean if the transaction 
that wrote it has committed. When an object is written to the disk-resident copy 
of the database, we say that it is flushed; we say that the object is unflushed 
otherwise (similarly, we say that a DLR is flushed when its corresponding object 
has been written to the database and we say that it is unflushed otherwise). 
A DLR (flushed or unflushed) containing the most recent image of an object is 
said to be current; a DLR (flushed or unflushed) containing a prior image of 
an object is said to be stale. 

Given these definitions, we can describe concisely the four types of live log 
records that may exist: 

— Live clean REDO. This is the current, unflushed REDO DLR for a given 
object, written by a transaction that has committed. 

— Live dirty REDO. This is the current, unflushed REDO DLR for a given 
object, written by a transaction that is still active. 

— Live UNDO. This is the current, unflushed UNDO DLR for a given object, 
written by a transaction that is either still active or has aborted. 

— Live COMMIT. This is the COMMIT TLR of a transaction for which at 
least one live clean REDO exists. 

Stale and flushed DLRs are considered garbage, as are UNDO DLRs written 
by committed transactions and REDO DLRs written by aborted transactions. 

3.2 Effects of Garbage Log Records 

Garbage log records become an issue when system recovery is considered. While 
the system is operating normally, live log records can be identified by consulting 
the main-memory log directory. However, since the directory is lost after a system 
crash, the recovery manager may not be able to distinguish between live and 
garbage log records based on logged information alone. Therefore, during system 
recovery, the recovery manager may have no choice but to assume that certain 
garbage log records are live. In this case, we say that these log records are 
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recycled. Recycled log records — although they may waste recovery resources — 
will not hinder the ability to recover the database as long as means exist to deal 
with them. 

The ability to recover the database in the presence of recycled log records 
is due in part to the idempotence of physical REDO and UNDO. For example, 
suppose transaction U updates object oi and commits. Suppose then that Oi is 
flushed to the database. At this point, there are no live log records, although 
there is a garbage REDO and garbage COMMIT in the log. If the system crashes 
and then recovers, these two log records will be recycled. Because objects do not 
contain timestamps, what was a current flushed clean REDO with respect to 
normal system operation is now considered a current unflushed clean REDO — 
a live log record. Accordingly, the garbage COMMIT is now considered a live 
COMMIT. As a result, the database is restored with the image found in the 
REDO log record. This causes no problem, however, since restoring the database 
with an object image is an idempotent operation. 

The ability to recover the database in the presence of recycled log records 
is also dependent on the removal (truncation) of garbage log records in their 
correct order. For correct recovery, the following Garbage Removal Rules 
(GRRs) must be obeyed: 

GRRl: Remove stale DLRs before current DLRs. If the log contains a 
current flushed DLR and one or more stale versions of it, the stale DLR(s) must 
be removed from the log before the current flushed DLR. 

GRR2: Remove UNDO DLRs before COMMIT TLRs. If the log 

contains a garbage COMMIT TLR and one or more garbage UNDO DLRs 
written by the committed transaction, the UNDO DLR(s) must be removed 
before the COMMIT TLR. 

GRRI is a consequence of database objects not containing timestamps. 
Without timestamps in objects, the recovery manager has no way of knowing 
whether a logged image is the current image or not; the best it can do is apply 
the most recent clean DLR (determined based on DLR timestamps) to the 
database. GRR2 is a consequence of the presumed abort protocol p. Without a 
COMMIT TLR in the log, the recovery manager would assume incorrectly that 
the transaction aborted; as such, it would apply the committed transaction’s 
UNDO DLRs to the database. 

To enforce the GRRs, dependencies between individual garbage log records 
must be taken into account. We say that garbage log record L2 depends on 
garbage log record L\ if L2 can be removed only after L\ is removed. In this 
case, we write Li— >-L 2 , which reads “remove Li before L 2 .” Based on the GRRs, 
we define the following Garbage Removal Dependencies (GRDs): 

GRDl: Stale DLR— ^^Gurrent DLR. A stale DLR must be removed before 
its corresponding current flushed DLR. (This follows directly from GRRl.) 
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GRD2: Current REDO DLR— ^COMMIT TLR. A current flushed REDO 
DLR participating in a GRDl must be removed before its associated garbage 
COMMIT TLR. To illustrate, suppose Li is a stale DLR, L 2 is a current flushed 
REDO DLR, Li— 7 ^L 2 is a GRDl, and L 3 is the garbage COMMIT TLR asso- 
ciated with L 2 . This means that a GRD2 L 2 — ?>L 3 exists. (This is a “cascaded” 
dependency following from GRRl.) 

GRD3: UNDO DLR-^COMMIT TLR. A garbage UNDO DLR must be 
removed before its associated garbage COMMIT TLR. (This follows directly 
from GRR2.) 

Given GRDs, garbage removal becomes another “event” upon which the 
logical lifetimes of log records depend. 



3.3 Lifetimes in an Ordered Log 

An ordered log is a log in which log records retain their original ordering; 
that is, a log in which log records are never copied. In an ordered log, GRDs 
are satisfied implicitly. An object’s stale DLRs will always precede 0 its current 
DLRs, a committed transaction’s current REDO DLRs will always precede its 
COMMIT TLR, and a committed transaction’s UNDO DLRs will always precede 
its COMMIT TLR (assuming the flushing of UNDO information is synchronized 
with the commit of its corresponding transaction). Since dependent log records 
always follow the log records on which they depend, garbage log records will be 
removed in their correct order. This is because truncation always proceeds from 
head to tail. As such, the presence of garbage log records does not affect the 
logical lifetimes of log records in an ordered log. Therefore, the logical lifetimes 
are the same as the ideal logical lifetimes described in Sect. 13. 1 1 



3.4 Lifetimes in an Unordered Log 

An unordered log |3| is a log in which log records are copied. In an unordered 
log, GRDs are not satisfied implicitly; GRRs need to be enforced with explicit 
GRDs. An explicit GRD is needed when a garbage log record precedes a garbage 
log record on which it depends. 

The presence of explicit GRDs has negative implications regarding log man- 
agement. One problem is that log records that would otherwise be garbage are 
kept live, thus extending their logical lifetimes. For example, if a committed 
transaction’s GOMMIT TLR appears before one of its UNDO DLRs, the GOM- 
MIT TLR must be treated as live — even if there are no live clean REDO DLRs 
for the transaction. The GOMMIT TLR must remain live until the UNDO DLR 
is removed, and the only way the UNDO DLR can be removed is if the GOMMIT 
TLR is copied “ahead” of the UNDO DLR in the log. 

^ To say that log record Li precedes log record L2 in a circularly managed log means 
that Li appears before L2 logically with respect to time. 
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Log records that are kept live due to explicit GRDs are referred to as live- 
garbage. This is a hybrid classification. On the one hand, the log record is 
garbage, at least with respect to the database. On the other hand, the log record 
is live, at least with respect to the log. To ensure that the log record stays in the 
log and is recycled after a system crash, it must be treated as live — and thus 
tracked in main-memory — during normal system operation. A live-garbage 
log record is thus viewed as a garbage log record needed for correct recovery. 
Live-garbage log records increase the amount of log space consumed, increase 
the amount of main-memory used to track log records, and increase the number 
of log records that need to be copied. 

Another problem with explicit GRDs is that garbage log records that are 
depended upon must be tracked in main-memory until they are removed from 
the log (i.e., overwritten with new log records or copies of old ones). We refer to 
such log records as tracked-garbage. Tracked-garbage log records are garbage 
in that they can be removed from the log at the next opportunity; however, 
unlike garbage log records, their entries remain in the main- memory log directory 
until they are removed from the log. This increases the amount of main- memory 
consumed. 

Whether a garbage log record is classified as live-garbage or tracked-garbage 
depends on whether there are any existing tracked-garbage log records on which 
it depends. If there is no existing tracked-garbage, the garbage log record is 
classified as tracked-garbage; otherwise, the garbage log record is classified as 
live-garbage. Live-garbage log records become tracked-garbage when the tracked- 
garbage on which they depend is removed. 

To summarize, garbage UNDO DLRs and garbage REDO DLRs (written 
by active and committed transactions) must be tracked. A DLR with the 
most recent image of an object must be removed from the log last. To keep a 
clean REDO DLR containing the most recent image of an object recyclable, 
its associated GOMMIT TLR must be kept in the log as well. Finally, to 
enforce the presumed abort protocol, a GOMMIT TLR must be kept in the 
log until all associated UNDO DLRs are removed. This results in extra log 
and main-memory overhead for transactions that have already completed and 
changes that have already been propagated to the database. 

Figure n compares the lifetimes of log records in an ordered vs. unordered log. 
Note that in an unordered log, not all log records pass through all lifetime 
phases. For example, a live log record could go directly to garbage, or a 
tracked-garbage log record could go directly to dead (for the latter, this is the 
most common case). In any case, lifetime phases must proceed from left to right 
in the diagram. 

3.5 Lifetimes in a Partitioned Log 

XEL introduces the notion of a partitioned log, a log divided into n > 2 
generations. Live log records are copied from one generation to the next (for- 
warded) and within the last generation (recirculated). A partitioned log is like 
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Fig. 1. Log record lifetimes in an ordered vs. unordered log. 



an unordered log in that it has the same explicit GRDs and lifetimes. An explicit 
GRD Li— ?>L2 can exist between generations Gi and Gj, i^j, or within generation 
Gn-i- Explicit GRDs within G„_i are due to G„_i being unordered. Explicit 
GRDs between generations are due to garbage removal being done separately by 
generation; without explicit GRDs, garbage log records could be removed out- 
of-order. An explicit GRD ensures that a dependent log record is forwarded or 
recirculated, as necessary, until the log record upon which it depends is removed. 



3.6 Lifetimes in an XEL Log 

Log record lifetimes in XEL are as described in the previous section. In addition, 
XEL introduces — unnecessarily — extra live-garbage by tracking all GRDs as 
explicit. XEL’s behavior with respect to GRDs is the same regardless of whether 
the log is partitioned or non-partitioned (i.e., single-generation). 

To track and enforce GRDs, XEL assigns a status to each log record, which 
is stored in each log record’s main-memory directory entry. The status reflects 
whether the log record is live, live-garbage, or tracked-garbage (according to our 
classification). For example, a REDO DLR is live if its status is unflushed, live- 
garbage if its status is required, or tracked-garbage if its status is recoverable 
(a garbage REDO DLR has status non-recoverable, but that is not recorded 
since a garbage log record has no main-memory representation). The presence of 
a required REDO DLR implies the presence of a GRDl; for example, recoverable 
REDO mjR-^required REDO DLR. UNDO DLRs and GOMMIT TLRs also 
have statuses assigned to them, although they do not map one-to-one with our 
classification in all cases. For example, an annulled UNDO DLR is unambigu- 
ously tracked-garbage; however, a required GOMMIT TLR may be live or live- 
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garbage, depending on the status(es) of its associated DLR(s). A live-garbage 
required COMMIT TLR implies that a GRD2 {required REDO required 

COMMIT TLR) or GRD3 {annulled UNDO BLR^required COMMIT TLR) or 
both are present. Log records with statuses that make them live or live-garbage 
can be forwarded or recirculated. 

XEL treats all GRDs as explicit, even GRDs that are otherwise implicit. This 
can lead to extra overhead due to the way the log is managed. Recall that n pages 
of free space are maintained between the tail and head of each generation |B| . If 
the log records involved in an implicit CRD both appear in the current n page 
range, the dependent log record will be treated as live-garbage and forwarded or 
recirculated unnecessarily. If the log records involved in the CRD do not appear 
in the current n page range, then no extra overhead is incurred; the log record 
depended upon will be removed before the dependent log record is considered 
for forwarding or recirculation. 

As an example, suppose we have a two-generation log and that a given 
transaction’s UNDO DLR is followed by its COMMIT TLR in Gq. This gives 
an implicit GRDS. Suppose further that both log records appear within the 
currently advanced n page window. XEL will forward the COMMIT TLR to 
Gi, not recognizing that the UNDO DLR is guaranteed to be removed first. 
The result is that a real explicit GRDS is actually created (fortunately, the 
UNDO DLR will in all likelihood be removed before the forwarded COMMIT 
TLR becomes a candidate for recirculation). 

To summarize, GRDs ensure that the log — and ultimately the database — re- 
mains recoverable when log records get reordered. GRDs extend the log and 
main-memory lifetimes of log records. Using GRDs, we can classify log records 
into four types: live (£), live-garbage {LG), tracked-garbage {TG), and garbage 

(G). 

Figure 0 gives an example of all six GRD combinations and their effect on 
log record lifetimes in XEL. Suppose we have a two-generation XEL log. Trans- 
action t\ updates object oi and REDO DLR Ropi and UNDO DLR Uoiti are 
logged. Then, before G commits and logs COMMIT TLR Gt^, and Uoiti 

are forwarded to generation Gi. Suppose that transactions t2 and ^4, running 
in succession, each update object 04 and commit. Finally, suppose transaction 
tr updates object oi and then commits. If objects oi and 04 are flushed to the 
database, three implicit GRDs and three explicit GRDs exist as depicted in the 
figure. Since XEL treats implicit GRDs as explicit, unnecessary LG is maintained 
(unnecessary LG is marked with ‘*’s). 

We’ll end our discussion of log record lifetimes with a few additional com- 
ments on GRDs. One thing to note is that a GRDl can be broken prematurely 
if a stale REDO DLR’s associated COMMIT TLR is removed first. This is be- 
cause a REDO DLR without a corresponding COMMIT TLR is G and cannot 
be recycled. Also note that there is no LG UNDO in XEL. This is because XEL 
retains a stale REDO DLR as LG even if it is superseded by a current flushed 
UNDO DLR. Although this means XEL handles UNDO and REDO dependen- 
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Fig. 2. An example showing all six GRD combinations (explicit GRDs are prefixed 
with an ‘e’, and implicit GRDs are prefixed with an ‘i’). 



cies asymmetrically, it has no effect on its correctness (a current UNDO contains 
the same image as the previously current REDO). Finally, note that XEL relies 
on GRDs to set DLR timestamps correctly. A given object’s timestamp can be 
reset to zero only when there are no more L, LG, or TG DLRs for the object. 

4 Simulation Environment and Methodology 

To quantify the effects of GRDs, we implemented a version of XEL based on de- 
tails found in m We built it on top of a detailed database simulation environ- 
ment which includes object locking, page latching, LRU cache page replacement, 
and log disk simulation (including read and write buffering and group commit [Q 
for batching writes to disk). We use GSIM18 jO] as our discrete-event simulation 
engine. In our model, simulation time advances as a result of disk I/O. 

Our goal is to demonstrate how main-memory and log space allocation vary 
with changes in workload and database parameters. We run sets of experiments 
that vary transaction duration, object size, object skew, and log size. For easy 
comparison across all experiments, we vary only one parameter value at a time in 
each experiment; all remaining parameters take their default values (as specified 
in Tabled). Although the database we model is small, our results demonstrate 
the basic effects that GRDs have in XEL. 

For each experiment, we measure the average main-memory and log space 
allocation. We classify main-memory allocation into three categories: L, LG, 
and TG. We classify log space allocation into these same three categories plus 
an additional one — G. L, LG, and TG are measured by periodic sampling 
throughout each experiment; G is derived as the size of the log less the sum 
of the other three components. Main-memory is measured by assigning sizes to 
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Table 1. Default parameter values. 



Parameter 


Value 


txRate 


100 tx/sec 


txDuration 


1 sec 


txNumUpdates 


4 


txObjectSkew 


0.1 


txCommitProbability 


0.95 


diskPageSize 


4096 bytes 


databaseObjectSize 


100 bytes 


databaseSize 


10000 pages 


cacheSize 


500 pages 


cachePolicy 


LRU 


logSize 


250 pages 



each data structure element that varies in number with transaction load: LOT 
entries, LTT entries, cells, and “transaction object id set elements” (we used 
the same values as in |^). We update the classification of each data structure 
element as the classifications of its related log records change. Due to specifics 
of our implementation — for example, because we use Log Sequence Numbers 
(LSNs) PH instead of object-level timestamps — our DLR header size and TLR 
size (43 bytes and 28 bytes, respectively) are bigger than in |B|. While this may 
affect the absolute values of our results, it should have little or no effect on the 
relative trends. 

For each experiment, we wait until a steady state is reached before recording 
data (most of the experiments reported here stabilize after two passes of the 
log; some, like those with longer transaction durations, take longer). We record 
data for a minimum of ten passes over the log. For experiments running in a 
partitioned log, the minimum pass requirement is met by all generations. 

5 Experimental Results 

We perform four sets of experiments using a non-partitioned log (referred to as 
XEL-1 ) where we vary transaction duration, object size, object skew, and log 
size. We perform two sets of experiments using a two-generation partitioned log 
(referred to as XEL-2) where we vary transaction duration and object size. For 
the XEL-2 experiments, the log is partitioned such that Gq has 20% of the total 
log space. 

5.1 XEL-1 Results 

Figures 0(a) and 0(b) show the results for XEL-1 when transaction duration d 
is varied at 1, 3, 5, 7, and 9 seconds. As transaction duration increases, the L 
components of both main- memory and log space increase. This is due primarily 
to the increased number and lifetime of UNDO DLRs. The “spike” in main- 
memory at d = 1 in Fig. Ha) is due indirectly to the disproportionately low 
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number of UNDO DLRs in the log for this case relative to the others. At d = 1, 
there is a relatively low probability that a dirty page will be flushed to the 
database; hence, a relatively small number of UNDO DLRs are logged. At d = 3, 
it turns out that the rate at which dirty pages are flushed increases substantially 
when compared to d = 1. Beyond d = 3, the dirty page flush rate increases, but 
not as rapidly (in fact, there is a bit of a jump at d = 5, but beyond that it 
starts to level out). 

To understand how the amount of UNDO information in the log affects main- 
memory requirements, consider as an example two extremes: either no UNDO 
DLRs are written or an UNDO DLR is written for each REDO DLR. In the first 
case, the number of objects represented in the log is effectively double that of the 
second (ignoring duplicate updates). Consequently, the number of transactions 
represented in the log is also double. This effectively doubles the amount of 
main-memory overhead for the object and transaction related data structures. 
This phenomenon explains the higher amount of TG main-memory at d = 1. 

An additional phenomenon takes effect starting at the d = 3 second case. 
Because of the jump in UNDO information in the log, there is a corresponding 
jump in LG main-memory for the transaction related data structures. This is 
due to the increase in the number of COMMIT TLRs that must be retained as 
LG. 

Figures0(a) andEI)b) show the results for XEL-1 when object size o is varied 
at 50, 100, 150, 200, and 250 bytes. L main-memory usage is flat since the main- 
memory representation of an object is independent of its size (there are the 
same number of L DLRs in the log for each experiment). The most interesting 
result is that total main-memory usage decreases as object size increases. The 
shape of the LG and TG curves in Fig. EJa) is explained by observing that the 
total number of objects represented in main-memory is inversely proportional 
to object size. As objects get bigger, less DLRs fit in the log. If we ignore the 
overhead of TLRs, we can approximate the number of DLRs in the log as 



numDLRs = 



logSize 

objectSize + dir Header Size 



( 1 ) 



Total main-memory overhead is proportional to numDLRs. 

Figures 0)a) andEl)b) show the results for XEL-1 when object access skew s 
is varied at 0.5, 0.4, 0.3, 0.2, 0.1 (skew is deflned as in s = 0.5 means uniform 
access to database objects, whereas s = 0.1 is a highly skewed access meaning 
90% of the updates go to 10% of the database). With increased skew, there is 
increased locality of access. This means that pages will be flushed less frequently 
from the cache and thus REDO DLRs will remain L a little longer. In addition, 
since more objects get repeat updates, there is increased LG due to increased 
GRDs. These effects start to magnify at the higher skews of 0.2 and 0.1. 

Figures 0 a) and m show the results for XEL-1 when log size l is varied 
at 250, 300, 350, 400, and 450 pages. L main-memory and log space usage is flat 
since the same workload is run for all experiments. However, since more objects 
fit into more log space, there is more LG and TG main-memory and log space 
overhead. 
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5.2 XEL-2 Results 

Figures EKc) and Eld) show the results for XEL-2 when transaction duration d 
is varied at 1, 3, 5, 7, and 9 seconds. The L components are the same as in the 
XEL-1 experiments since the same parameters are used. However, partitioning 
the log changes the way in which the LG and TG components react. 

The first thing to notice is that TG main-memory increases as we go from 
d = 1 to d = 3. This is the opposite of what happens for XEL-1, although it 
happens for the same underlying reason — a low number of UNDO DLRs at 
d = 1. With little UNDO, there are many more transactions with just REDO in 
the log. However, since a pass over Gq is much faster in XEL-2 than in XEL-1, 
COMMIT TLRs are removed faster and hence more REDO DLRs become G 
sooner. The larger G component at d = 1 displaces what would otherwise be 
TG. 

Another interesting result is the jump in LG main-memory at d = 5. This is 
due to many more UNDO DLRs being forwarded to G\ than for d = 3. This is 
because at d = 5 (and beyond), transaction duration exceeds the average time 
it takes to complete a pass over Gq. As a result, the number of COMMIT TLRs 
that must be retained as LG (and subsequently forwarded) increases. Moreover, 
since it takes about ten times as long to complete a pass over Gi as it does to 
complete a pass over Gq, the lifetimes of TG UNDO DLRs — and hence their 
corresponding LG COMMIT TLRs — are greatly extended. 

Interestingly enough, overall main-memory for the d = 5 case decreases. This 
is because the increase in LG is more than offset by a decrease in TG. First of 
all, the increase in LG for the transaction related main-memory data structures 
is accompanied by a corresponding decrease in TG for those same structures. 
In addition, TG is reduced further because the dirty page flush rate increases 
enough at d = 5 so that an increased amount of UNDO in the log displaces 
a corresponding amount of REDO (as explained earlier for XEL-1). Beyond 
d = 5, main-memory continues to decrease slightly. This is because pass time 
over Gi decreases, resulting in quicker removal of TG UNDO DLRs and hence 
their corresponding LG COMMIT TLRs. 

Figures 0(c) and^ld) show the results for XEL-2 when object size o is varied 
at 50, 100, 150, 200, and 250 bytes. Since the same parameters are used as for 
XEL-1, the L components are the same as in XEL-1. As in XEL-1, the shape of 
the main-memory curve in Fig. 0(c) is due to the inverse relationship between 
main-memory and object size. One notable difference is the increased LG for 
XEL-2. LG is relatively large at o = 50, drops off abruptly at o = 100, and 
then drops off more smoothly beyond that. This is because logging bigger DLRs 
consumes the log faster (since the same number of DLRs are logged for each 
experiment). The result is that pass time, and in particular G\ pass time, varies 
inversely with the size of objects. Cycling through Gi when o = 50 takes about 
five times as long as it does when o = 100 (cycling through Gi when o = 100 
takes only about twice as long as it does when o = 150). For o = 50, the 
considerably longer lifetime of TG in G\ results in the extra LG reported. 



XEL-1 log allocation (bytes) XEL-1 main memory allocation (bytes) 0^ XEL-1 log allocation (bytes) XEL-1 main memory allocation (bytes) 
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3. Main-memory and log space allocation for increasing transaction dnration. 





Fig. 4. Main-memory and log space allocation for increasing object size. 
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Fig. 6. Main-memory and log space allocation for increasing log size. 




An Analysis of Main-Memory and Log Space Usage 



41 



5.3 Discussion 

One consequence of a common set of parameters is that log size cannot be 
“tuned” ad hoc to the minimum required to support the workload of each ex- 
periment (as is done in jHEl)- This means that the amounts of TG (and to a 
lesser extent LG) reported are higher than they would be if the size of the log 
were minimized. On the other hand, it is unlikely that a log used within the 
context of a real database system would (or could) be tuned in this manner. 
More likely, log size would be fixed and would incorporate extra space to accom- 
modate peaks in workload. In addition, there will be times when the amount of 
L in the log is low and thus the amount of TG and LG in the log is high (such 
as when the workload quiesces temporarily). Also, in the interest of choosing 
generation sizes such that ‘only a small fraction of log records need to be for- 
warded or recirculated’ |H|, log space utilization HD would need to be kept low. 
This can be accomplished by increasing the size of the log beyond the workload’s 
capacity needs. Therefore, we expect that in a real system a nontrivial amount 
of main-memory will always be dedicated to tracking obsolete recovery data. 



6 Summary and Future Work 

We analyzed XEL qualitatively and quantitatively to show how main-memory 
and log space allocation are affected by copying log records within the log. 
When the log becomes unordered, dependencies among log records must be 
tracked explicitly to constrain the order in which log records are overwritten. 
This results in log records that remain live past their ideal logical lifetimes, thus 
increasing main-memory and log space usage. In addition, we’ve shown that 
overall main-memory usage increases as the log gets bigger or objects get smaller. 

It would be interesting to repeat our experiments for Ephemeral Logging 
(EL) El. In EL, database objects contain timestamps. In 0, it is stated 
that this would result in less main-memory and log space usage. Using our 
analysis — for example, noting that only GRD3s would exist in EL — this 
difference in main-memory and log space usage could be quantified. Also, 
it would be worthwhile to investigate how changes in workload parameters, 
database parameters, and log partition affect transaction throughput (in both 
XEL and EL). 
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Abstract. As XML is increasingly being used in Web applications, new 
technologies need to be investigated for processing XML documents with high 
performance. Parallelism is a promising solution for structured document 
processing and data placement is a major factor for system performance 
improvement in parallel processing. This paper describes an effective XML 
document data placement strategy. The new strategy is based on a multilevel 
graph partitioning algorithm with the consideration of the unique features of 
XML documents and query distributions. A new algorithm, which is based on 
XML query schemas to derive the weighted graph from the labelled directed 
graph presentation of XML documents, is also proposed. Performance analysis 
on the algorithm presented in the paper shows that the new data placement 
strategy exhibits low workload skew and a high degree of parallelism. 

Keywords: Data Placement, XML Documents, Graph Partitioning, and Parallel 
Data Processing. 



1 Introduction 

As a new markup language for structured documentation, XML (extensible Markup 
Language) is increasingly being used in Web applications because of its unique 
features in data representation and exchange. The main advantage of XML is that 
each XML file can have a semantic schema and makes it possible to define much 
more meaningful queries than simple, keyword-based retrievals. A recent survey 
shows that the number of XML business vocabularies has increased from 124 to over 
250 in six months [1]. It can be expected that data in XML format would be largely 
available throughout the Web in the near future. As Web applications are time 
vulnerable, the increasing size of XML documents and the complexity of evaluating 
XML queries pose new performance challenges to existing information retrieval 
technologies. The use of parallelism has shown good scalability in traditional 
database applications and provides an attractive solution to process structured 
documents [2]. A large number of XML documents can be distributed onto several 
processing nodes so that a reasonable query response time can be achieved by 
processing the related data in parallel. 
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In parallel data processing, effective data placement has drawn a lot of attention 
because it has a significant impact on the overall system performance. The data 
placement strategy for parallel systems is concerned with the distribution of data 
between different nodes in the system. A poor strategy can result in a non-uniform 
distribution of the load and the formation of bottlenecks [3]. In general, determining 
the optimal placement of data across nodes for performance is a difficult problem 
even for the relational data model [4]. XML documents introduce additional 
complexity because they do not have a rigid, regular, and complete structure. 
Although some XML documents may have a DTD (Document Type Definition) file 
to specify their structures and the W3C (World Wide Web Consortium) is working on 
the XML Schema standard, either DTD or XML Schema is an optional companion to 
the XML documents. We cannot expect that every XML document on the Web is a 
valid XML file, which means that it conforms to a particular DTD or XML Schema. 

In this paper, we use the labelled directed graph model to represent XML data. A 
graph partition algorithm is explored to maximise the parallelism among the different 
processing nodes in a shared-nothing architecture where each node has its own 
memory and disks. The distribution of the data is dependent on the queries applied to 
the data. XML queries are based on path expressions because of its lack of schema 
information. As path expressions access data in a navigational manner, elements 
along the objective path should be placed together to minimise communication cost. 
At the same time, data relative to the same query should be distributed evenly to 
different nodes to achieve the load balance. These two objectives are both considered 
in the new proposed data placement strategy. Moreover, the new strategy is based on 
the unique features of XML documents and the distribution of XML query sets. This 
paper also presents the performance analysis on the new data placement strategy. 

The remainder of the paper is organised as follows: Section 2 presents the related 
work and motivations of the study. Section 3 describes the XML data model and the 
algorithm for deriving the weighted graph of XML documents. Section 4 proposes a 
new graph-partitioning algorithm based on the features of XML documents. Section 5 
analyses the performance of the new algorithm. Section 6 concludes the paper and 
discuss pending research issues. 



2 Related Work and Motivations 

Effective parallelisation of data queries requires a declustering of data across many 
disks so that parallel disk I/O can be obtained to reduce response time. A poor data 
distribution can lead to a higher workload, load imbalance and hence higher cost. 

Various data placement strategies have been developed by researchers to exploit the 
performance potential of shared-nothing relational database systems. Since the 
complexity of the problem is NP-complete [5], heuristics are normally used to find a 
nearly optimal solution in a reasonable amount of time. According to the criteria used 
in reducing costs incurred on resources such as network bandwidth, CPUs, and disks, 
data placement strategies can be classified into three categories, which are network 
traffic based [6], size based [7], and access frequency based [8]. The main idea of 
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these approaches is to achieve the minimal load (e.g. network traffic) or a balance of 
load (e.g. size, I/O access) across the system using a greedy algorithm. Our algorithm 
is a combination of the network traffic based and access frequency based strategy, 
because it aims to minimise the communication cost and to maximise the intra- 
operation parallelism. 

In parallel object-oriented database systems, data placement strategy is also critical to 
the system performance and is far more complex. [9] pointed out that in designing a 
data placement method for a parallel object-oriented databases, two major factors that 
most of the time contradict each other must be taken into account: minimising 
communication cost and maintaining load balance. [4] used a greedy similarity graph 
partitioning algorithm to assign object into different processing nodes aiming to 
minimise inter-node traversals and maximise parallelism. This algorithm attempts to 
place objects that have a higher degree of similarity on different disks, where two 
objects are more similar if they are accessed together in a navigational manner but 
less similar if the two objects can be accessed together in a parallel manner. Although 
the paper gives an equation to compute the similarity between two nodes, there’s no 
definite method for getting the weights between two nodes. 

Data placement strategies in both relational and object-oriented parallel database 
systems could be helpful to the study of the data placement strategy for XML 
documents. The idea of our data placement strategy for XML data is similar to those 
in parallel object-oriented databases. But we focus on how to construct the weighted 
graph from the original XML document, which forms the basis of the graph 
partitioning algorithm. The objective of the research is trying to find a nearly optimal 
data distribution so that the system throughput and resource utilisation can be 
maximised. Our graph partition algorithm is based on the multilevel graph partition 
algorithm for its efficiency and accuracy. The unique features of XML documents and 
XML queries have been studied to provide the foundation for the graph partition. 



3 Graph Model of XML Data 



3.1 Labelled Directed Graph 

The latest W3C working draft on XML Information Set (InfoSet) [10] provides a data 
model for describing the logical structure of a well-formed XML 1.0 document. In 
this model, an XML document’ s information set consists of a number of Information 
Items, which are abstract representations of some components of an XML document. 
For example, in the XML document of figure 2, there are three different types of 
information item: document information items, element information items, and 
attribute information items. The specification presents the information set as a tree 
and accordingly the information items as the node of the tree. Any information item in 
the XML document can be reached by recursively following the properties of the root 
information item. Similar to the data model used in Lore [11], we extended the 
InfoSet data model to a directed labelled graph, where the vertices in the graph 
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represent the information items and arcs represent the semantic links between the 
information items. 

<Publications> 

<Proceeding> 

<Conf erence>VLDB< / Conf erence> 

<Year>1999</Year> 

<Location>Edinburgh</Location> 

<Article id='Al' ref erence= ' A2 A3' > 

<Title>Query Optimization for XML </Title> 

<Author> Jason McHugh </Author> 

<Author> Jennifer Widom</Author> 

</Article> 

</ Proceeding > 

<Proceeding> 

<Conf erence>ICDT< / Conf erence> 

<Year>1997</Year> 

<Article id="A2"> 

<Title>Querying Semi -Structured Data< /Title > 
<Author>Serge Abiteboul</Author> 

</Article> 

</ Proceeding > 

<Proceeding> 

<Conf erence>ICDE< / Conf erence> 

<Year>1998</Year> 

<Article id='A3' Ref erence= ' A2 ' > 

<Title>Optimizing Regular Path Expressions Using Graph 
Schemes < /Tit le> 

<Author>Mary F. Fernandez</Author> 

<Author>Dan Suciu</Author> 

</Article> 

</ Proceeding > 

</Publications> 



Fig. 1. An example for XML documents 



Figure 2 describes the graph presentation of the XML document in Figure 1. We use 
the definition in [12] as our definition for the labelled directed graph: 



Definition 3.1 Let L be an arbitrary set of labels. A tuple G = (V, A, /) is a L- 
labelled directed graph, if V is a set of vertices, A is a set of arcs, S and t are total 
functions from A to V assigning each arc its source and target vertex, and / is a total 
label function from A to L assigning each arc a label. 

We can see that the labelled directed graph of single XML document is actually a 
graph with a unique root. Any vertex in the graph can be reached from the root by 
following a certain path. 
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Proceeding' 



Proceedings 

Proceeding 

t 



Proceeding 



conference^ \ 

/ Year location Article 



conference \ 

Article 



i 






f 3 ) ( 4 I - 5 \ I 
V.._y v..y A 

VLDB 1999 Edinburg f6 



{ I'Y^ 



Title' 




Author 



Query / 

Optimization for / ^ ^ 

XML Jennifer Widom 

v,.,y 
Jason 
McHugh 




Querying Semi- 
Ttmctured Data Serge Abiteboul 




-Reference' ^ 22 '' 

v..y 

optimizing Regular Path Mary F. Fernandez 
Expressions Using 
Graph Schemas 



Fig. 2. The labelled directed graph representation for the XML document in Figure 1. 

Definition 3.2 A nonempty sequence (ViO,an,vn,...,aim, v™) is called a path in the 
graph G = (V ,A,S,t,l,) if s(aij) = Vy i and t(aij) = Vy for all positive j m, 
and all arcs and all vertices in that sequence are pair wise distinct. 



3.2 Weighted Graph 

Query languages for XML documents generally utilise path expressions to exploit the 
information stored in XML documents. Path expressions are algebraic representations 
of sets of paths in a graph and are specified by a sequence of nested tags. For 
example, the path expression „proceeding.article. title" for the XML document in 
Figure 1 refers to the titles of all articles published in all proceedings. As shown in 
[12], an XML query can also be presented by a labelled directed graph. Two XML 
queries and their graph presentations were shown in Figure 3. The elements in the 
graph are labelled with predicates, where the predicate true( ) serves as a wildcard. 

Definition 3.3 Given a set of unary predicates P , a tuple Gq = (Vq, Aq, Sq, tq, Iq) is 
a query schema if the elements are labelled with predicates (/ : Vq Aq P) . 

The graphs in Figure 3 can act as schemas, which partly describe the structure of the 
XML document. If the predicate in a schema is true for the corresponding vertices and 
arcs in an instance, we say that the instance conforms to the schema. The answer to a 
query of XML documents is the union of all instances conforming to the query 
schema. If those instances could be evenly distributed among several different disks 
and therefore could be accessed in parallel during the query processing, the response 
time for a query would be largely shortened. Meanwhile, one instance should avoid 
spanning multiple partitions to reduce the communication cost. These two objectives 
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conflict because the first one tries to distribute vertices across as many partitions as 
possible, while the second one tries to group the relevant vertices together. 



trueO 

9 

Article 



Author Title 

(9 b 

’Jenifer Widom’ true() 



true() 



Reterence 



Title 






Article 

trueO ^ 

Author 

(5 

’Jenifer Widom’ 



Q1 : select all papers authored by Q2: select all papers that cite the papers 
’Jenifer Widom’ authored by ’Jenifer Widom’ 



Fig. 3. Two typical XML queries and their corresponding graph based presentation 



The data placement of XML documents over different sites can be viewed as a graph 
partitioning problem. Each edge between two vertices in the graph is associated with 
a weight to describe the frequency of traversals on it. The higher the weight is, the 
more possible it is to assign the two vertices to the same partition. In our algorithm, 
the weight between two vertices reflects two factors. One is the possibility of the two 
vertices to be accessed together in sequential manner, and another one is the 
likelihood of two vertices to be access together in parallel manner. 

With the knowledge on the distribution of the XML query set, a weighted graph could 
be derived based on the labelled directed graph defined in section 3.1. 

Definition 3.4 Gw = (V ,E,r,w) is the weighted graph for a labelled directed 
graph G = (V,A,S,t,l) , if £■ is a set of edges, r is a total functions from A to V 
assigning each edge its vertices, and W is a total weight function assigning each edges 
in £ a number to describe the traversal frequency of that edge. For each e E , there 
exists at least one arc a A with the same vertices as e . 



Algorithm 3.1 describes the method to derive the weighted graph from the original 
labelled directed diagram based on the query distributions. In this algorithm, if the 
arcs in the labelled directed graph are traversed in a query, the weights between the 
corresponding vertices are computed based on the query frequency. If there is more 
than one instance that conforms to a query schema, the arcs between any two 
instances are studied to compute the weight of the edges that connect these two 
instances. The value of parameter is an adjustable number between zero and one, 

which indicates the relative benefit by increasing the degree of parallelism compared 
with lowering communication cost. If the communication overhead is high, a higher 
value for can be chosen. 
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Algorithm 3.1 Assigning Weight Algorithm 

Input : 

The labelled directed graph G = (V , A, 

The graph presentation = {G^ ,0^ , -,G^ ) 

Where , Gf = (V/ , Af , sf , , If ), / = 1,2,..., n 

n 

Query distribution: = { i, 2 ,..., «}, where =1 

i=l 

Adjustable parameter ,0< <1 

Output: the weighted graph G'^ ={V,E,r,w) 

Begin 

Initialise the weight for each edge with zero 
For each query G? in the query set 

For each Gy = (Vy, Ay, S, f, Z), y = 1,2..., Wf conforms to Gf 
For each CL Aij 

If s(a) > t(a) Then u = t(a), V = s(a) 

Else u = s(a),v = t(a) 

End If 

w(u,v) = w(u,v)+ i 100 

End For 

For each Gik = {Vik, Aik, S, t, Z), j < k m 

ifadstsa A ,[s{a) Vij,t{a) Vik\t{a) Vij,s{a) Vik] 
If s{a) > t(a) Then u = t(a), V = s(a) 

Else u = s(a),v = t(a) 

End If 

w(u,v) = w(u,v) (1 ) i 100 

End If 
End For 
End For 
End For 
End 



4 Graph Partitioning Algorithm 

The graph partitioning problem is NP-complete [13], and heuristics are required to 
obtain reasonably good partitions. The problem is to decluster the graph into n 
partitions, such that each partition has roughly equal number of vertices and the 
number of traversals between different partitions is minimised. In the case of XML 
parallel processing, we aim at achieving lowest communication cost and gaining load 
balance among different processing nodes. 
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[13] introduced a multilevel graph partitioning algorithm, which generally consists of 
three phases: coarsening phase, partitioning phase, and uncoarsening phase. The 
graph is first coarsened down to a few hundred vertices, a bisection of this much 
smaller graph is computed, and then this partition is projected back towards the 
original graph. This algorithm is suitable for XML graph partitioning because vertices 
to be accessed in navigational manner could coalesce firstly to make sure that they are 
assigned to the same processing nodes. Experiments presented in [13] also showed 
that the multilevel algorithm outperforms other approaches both in computation cost 
and partition quality. Our new data placement strategy is based on a multilevel graph 
partitioning approach with the consideration of features of XML documents and XML 
query distributions. 

The goal of the coarsening phase is to reduce the size of a graph by collapsing the 
matching vertices together. The edges in this set are removed, and the two vertices 
connected by an edge in the matching are collapsed into a single vertex whose weight 
is the sum of the weights of the component vertices. The method used to compute the 
matching is crucial, because it will affect both the quality of the partition, and the time 
required during the uncoarsening phase. [13] described a heuristic known as heavy- 
edge matching (HEM) which tries to find a maximal matching that contains edges 
with large weight. The idea is to randomly pick an unmatched node, select the edge 
with the highest weight over all valid incident edges, and mark both vertices 
connected by this edge as matched. Because it collapsed the heaviest edges, the 
resulting coarse graph is loosely connected. Therefore, the algorithm can produce a 
good partition of the original graph. 

[14] argued that the HEM algorithm may miss some heavy edges in the graph because 
the nodes are visited randomly. To overcome this problem, they proposed a heaviest- 
edge matching by sorting the edges by their weights and visiting them in decreasing 
order of weight. HEM and its variants reduce the number of nodes in a graph by 
roughly a factor of 2 at each stage of coarsening. If r (instead of 2) nodes of the graph 
are coalesced into one at each coarsening step, the total number of steps can be 
reduced form log 2 (n/k) to logr(n/k). [14] used an algorithm called heavy-triangle 
matching (HTM), which coalesces three nodes at a time so that they can get 20% time 
saving. 

We call our coarsening algorithm HSM (Heaviest Schema Matching). Algorithm 4.1 
describes the details of the algorithm. In HSM, the vertices are no longer visited in 
random order. The edges are sorted by their weight and the vertices with the 
maximum weighted edge are selected to do the matching first. According to algorithm 
3.1, there is an edge between two vertices in the weighted graph only if there is an arc 
between them in the labelled directed graph. In the other word, the neighbour of a 
vertex V in the weighted graph can be accessed together with V by following a 
certain path. It is reasonable to collapse the matching vertex with its neighbour 
together as many as possible if the weight between them is high enough. This 
strategy can improve the efficiency of the coarsen phase. 
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The coarsen phase stops when the number of nodes in the coarser graph is small 
enough. The coarsened graph G; is made of multivertices and edges that have not 
been merged. A vertex in graph Gi is called multivertices if it contains more than one 
vertex of G . In a weighted graph, the weight of each edge indicates the possibility 
for the corresponding vertices to be accessed together both in sequential mode and in 
parallel mode. On the other hand, the weight also reflects the workload under the 
query distribution. When two vertices are collapsed together, we need to keep the 
weight information of the edge being merged. Therefore, we introduced a new 

notation W to denote the weight of the multivertices in the coarsened graph. The 
workload of each partition will be determined by the sum of the weight of edges and 
multivertices in that partition. 

Definition 3.5 is the coarsened graph of a weighted 

graph G'^ =(V, E, r, w) , if V, V V, where V is mad up of multivertices 

that are created by collapsing vertices from V , and VP is a total weight function 
assigning each multivertices in V a number to describe the workload of that vertex. 

Algorithm 4.1 Coarsening Graph 

Input: The labelled directed graph G = (V 0, A, S ,t,l) 

The weighted graph Go = (V 0, Eo,r,w) for G 

Output: Coarser graph Gn = (Vn, En,w) with vertices 
Begin 

i = 0 

Do while the nunher of vertices in Gi = {Vi, Ei, r, VP) is greater than N 
Sort the edges of Ei in descending order by their weights 
Assume (m,v) Ei is one of edges with maximum wei^t 
Call collapse _vertices{u,v) to get the new vertices v’ Vi' + 1 

/ = / + 1 

End Do 
End 

collapse _vertices{u,v) { 

Vivi = Vi {u,v} 

Ei + i = Ei (u,v) 

Build a new vertex v’ 

Compute the workload for the new vertex: 

vp(v’) = vp(v’) + w(u,v) 

For each neighbour X Vi of U and V 
If (u,x) Ei and (v,x) Ei Then 
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Ei + \ = Ei {(m,x),(v,x)} + (x,v’) 
w(x, v’) = w{u, x) + w(v, x) 

Else 

Ei + i = Ei ({m I v},x) + (x,v’) 
w(x,v’) = w([u I v},x) 

End If 

If w(u\x) w(u,v) then 

collapse _vertices{u,v) to get tie new vertloes v’ Vi + \ 

End If 
End For 

} 



The second phase of a multilevel graph partitioning algorithm is to compute a 
balanced bisection of the coarsened graph. [13] evaluated four different algorithms for 
partitioning the coarser graph. The basic idea of those algorithms is to form a cluster 
of highly connected nodes. We choose the graph growing heuristic for the 
partitioning phase. The heuristic computes a partition by recursively bisecting the 
graph into two sub-graphs of appropriate weight. To bisect a graph, we pick up a 
multivertices with the highest weight first, find its neighbours and neighbours’ 
neighbours in a heaviest-edge-first manner until the workload of the new partition 
reach the average workload of the graph. 

Algorithm 4.2 Partitioning Graph 

Input: Coarser graph Gn = (Vn, En,w) with vertices 

The number of processors : tW 

Output: The partition function P assigning each vertex V Vn to 
one of fn partitions 

Begin 

i = 2 

workload (Gn) = w(e) + w(v) 

e E„ V V„ 

Do while i <. tn 

. , , , worklaod (Gn) 

Average _ workload = 

i 

For 7 = 1 to X do 

Sort the multivsrtices in in descending order by their weighs 
Assutte V is one of multivertices with naximum wei^t 

workload (G^) = w(v) 

vr={v] 
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Do while 



w(p) + w(v)) < Average _ workload 



e Ei 



V VJ 



T r J 2 

Assume U is the nei^±x3ur of with the maximm edge wei^t 

VJ^=VJ^ {u} 



End Do 
End For 

i = i 2 

End Do 



End 



During the third phase of multilevel graph partitioning, the partition of the coarsest 
graph Gk is projected back to the original graph by going through the graphs 
Gk 1, Gk 2,..., Gi ■ The purpose of a partition refinement algorithm is to select two 
subsets of vertices, one from each part such that when swapped the resulting partition 
has smaller edge-cut. Many algorithms associate with each vertex V a quantity called 
gain, which is the decrease in the edge-cut if V is moved to the other part. These 
algorithms proceed by repeatedly selecting vertices with the highest gains from each 
part and updating the gains of the remaining vertices. Assuming P is the initial 
partition of the graph, the gain of a vertex is defined as the following: 



gv= w(v,u) w(v,m), where (v,m) E 

P{v) P{u) P(v)=P(u) 



( 1 ) 



If V is moved to the other partition, the gain of its neighbours should be modified. 
The algorithm stops when there’s no vertex with positive gain value left. 



5 Performance Analysis 

We used the DBLP [15] data set as our experiment data. The DBLP data set collects 
about 140,000 entries for published literature on database research area. The original 
DBLP database stored each entry in a separate XML file and organised them by 
multiple directories according to its origination. We parsed the files into entities, 
which represent the vertices in the graph, and tags, which are the labels In the graph. 
The hierarchy of the directories is also reflected in the graph representation. We 
specially checked the cite entity in each document and linked it to the corresponding 
vertices in the graph. The final graph for partitioning test contains 1,693,444 vertices 
and 1,802,158 arcs. We used the query set In [16] to test our algorithms, and the query 
frequency was also specified. 
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Table 1. Description of query sets and relative query fequencies for the experiment 





Query Description 


Result 

Number 


Frequency 


Casel 


Case2 


Case3 


Case4 


SQ_1 


Select the authors for a given title 


8 


30% 


20% 


20% 


10% 


SQ_2 


Select all papers authored by 
Michael Stonebraker 


169 


20% 


10% 


5% 


10% 


SQ_3 


Select all papers authored by 
Michael Stonebraker or Jim Gray 


242 


20% 


10% 


10% 


15% 


SQ_4 


Select all papers published between 
1990 and 1994 


47,527 


5% 


5% 


10% 


10% 


IQ-1 


Select all papers by Jim Gray that 
are quoted by Michael Stonebraker 


2 


10% 


30% 


20% 


10% 


IQ-2 


Select all papers that quoted 
Michael Stonebraker’ s papers and 
were published between 1990 and 
1994 


513 


10% 


20% 


25% 


30% 


IQ-3 


Select all pairs of papers that cite 
one another 


108,717 


5% 


5% 


10% 


15% 




Fig. 4. Communication costs of different 
numbers of processors 




Fig. 5. Workload skews of different num- 
bers of processors. 



For convenience, we briefly called our XML graph partitioning algorithm XGP. As 
the objectives of the XGP algorithm are to reduce the communication cost and lower 
workload skew, these two measures have been tested to check the quality of the 
algorithm. Figure 4 compares the communication costs when the round-robin 
algorithm and the XGP algorithm are used for data partitioning. The communication 
cost is indicated by the numbers of remote requested pages. We can see that the XGP 
algorithm produces less communication cost than the round-robin algorithm does. 
Figure 5 shows workload skews among the processing nodes. The workload skew is 
indicated by the difference befween the workload of each partition to the average 
workload. It is defined in formula (2) and (3). 



w(e) 

Avg _ workload = — 

m 



( 2 ) 
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m 

\workload (Pi) avg _ wokload | 

workload _ skew - — 

m 



( 3 ) 



It can be seen that the workload skew for the XGP algorithm doesnot change much 
with the increase of the number of processing nodes. XGP also produces a low 
workload skew, which is much less than the round-robin algorithm does. 




Fig. 6. Communication costs of different Fig. 7. Workload skews of different num- 

numbers of processors. bers of processors 

The last experiment was to test the impact of query distributions to the workload 
balance. We tested the communication costs and workload skews caused by 
partitioning with different query frequency distributions. Figure 6 and Figure 7 show 
that XGP algorithm performs well under all the four cases showed in table 1 . We can 
see that the communication costs and workload skews of four cases are quite close. 
Because the weight used in the XGP algorithm is dependent on the query frequency, 
the partitions for different query distributions will change accordingly. 



6 Conclusion 

In this study, we have developed a data placement strategy for the XML documents 
on parallel processing systems. This approach is based on the multilevel graph 
partitioning algorithm with consideration of the unique features of XML data and 
XML queries. A new algorithm is proposed for deriving the weighted graph from the 
labelled directed graph hy using the implied schema information from XML queries. 
According to our approach, entities to be accessed by navigation in a query would be 
assigned to the same processing node, and instances accessed by the same query are 
distributed evenly along all the processing nodes. In the coarsening phase of the 
multilevel graph partitioning algorithm, all vertices in the neighourhood of the 
selected matching vertex are coalesced based on their edge weight. This criterion 
speeds up the procedure of the coarsening and reduces the possibility of assigning 
vertices to be accessed by navigation to different processing nodes. In the partitioning 
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phase, the weights of multivertices are used to evenly distribute the workload of a 
query. The performance analysis shows that the partition produced by our algorithm 
could greatly reduce the communication cost and lower workload skew. In our future 
work, we will focus on the parallel processing of XML queries and the XML query 
optimisation with the consideration of different data placement strategies. 
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Abstract. Despite many years of research into mechanisms for checking 
integrity in databases, current commercial systems provide support for 
checking only the simplest forms of integrity constraints. However, both 
the users and developers of database applications are increasingly aware 
of the problems that can arise when poor quality data is allowed to enter, 
and remain within, large scale databases. Clearly, some form of integrity 
checking is required for such applications, but it must be packaged in 
a way that respects the business context in which the application is 
to operate. In particular, this means acknowledging that mission critical 
transaction processing cannot be delayed while integrity checks are made. 
In this paper, we propose an approach which makes use of periods of low 
database activity for integrity checking. We present a range of algorithms 
for scheduling integrity checks during these periods, and describe the 
results of our initial experiments with the system. 



1 Introduction 

Over recent years, both users and developers of information systems (IS) have 
become increasingly aware of the problems that can arise when poor quality data 
is allowed to enter, and remain within, large scale databases m- Studies place 
the proportion of bad data maintained by organisations at anything between 10% 
and 70% (e.g. The situation can be worse for some individual systems. For 
example, a recent BBC Radio 4 programme reported that when the contents of 
one U.K. metropolitan police database was compared with paper-based records 
100% of the records were found to contain some form of discrepancy |2|. There 
is also an accumulating body of anecdotal evidence highlighting the increasing 
effect of data quality problems on the lives of individuals im- 

These errors, many of which have hitherto lain “dormant”, are now coming 
to light as companies attempt to use their data in new and unanticipated ways. 
Perhaps, the most prominent example of this phenomenon is the data warehouse. 
The integration of data from different database systems, produced by different 
applications and managed by different parts of the organisation, into a single 
coherent framework typically reveals a surprising number of inconsistencies and 
errors within the original data sources |3| ■ Cleaning up the data prior to loading 
it into the warehouse requires a great deal of data processing effort (running 
queries to detect potential errors in the data) and human effort (correcting the 
errors found, in consultation with domain experts). While one-off improvement 
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efforts of this kind can result in dramatic reductions in error rates in databases, 
they are expensive and time-consuming. Moreover, since few organisations can 
afford to repeat the exercise regularly, they do nothing to maintain data quality 
in the long term. Once data cleansing is complete, the existing IS processes will 
continue to allow bad data to enter the system. 

An alternative approach is to monitor data quality continuously, and to insist 
that all errors are corrected before data can be entered into the system. This is 
exactly what an integrity constraint mechanism promises to do. The database 
owner describes the conditions that should be held by all reasonable and con- 
sistent data sets. The DBMS and application programs then check all updates 
executed against the database, and refuse to allow any that would violate those 
conditions. However, despite nearly thirty years of research into mechanisms for 
checking data integrity, application developers are still reluctant to implement 
any but the simplest and cheapest integrity constraints. The problem is that a 
generalised integrity checking mechanism is very expensive, and most developers 
have taken the view that they cannot afford to include even hand-coded integrity 
checks in their applications. They would prefer to live with the possibility of bad 
data entering the system rather than hold up the processing of mission critical 
transactions — something that might lead to a direct loss in revenue for the 
organisation in question pp. 

In fact, the problem with integrity checking is not so much that it is slow, but 
that it is slow at the most inconvenient of times. Because of the insistence that 
bad data must be prevented from entering the system at any cost, all integrity 
checking activities occur at update-time or on transaction commit. However, 
most updates are made during periods of high business activity, when the or- 
ganisation’s revenue is critically dependent upon the rate at which transactions 
can be processed. During periods of normal business activity, the application 
cannot afford to spend time processing complicated integrity checking queries, 
or for business transactions to be blocked until the integrity checker releases its 
shared read locks. In contrast, during the night, when the transaction processing 
rate is much lower, the database system will often sit idle for several hours at a 
time. 

In this paper, we propose a “middle way” for data quality assessment, be- 
tween the two extremes of one-off data cleansing activities and continuous in- 
tegrity checking. Under this approach, constraint checking is delayed until peri- 
ods when the load on the system is expected to be low, thus achieving a regular 
pattern of checking but without incurring an unacceptable cut in the rate at 
which the system can process important business transactions. We present the 
basic architecture of our system (called LOIS - the “Lights Out Integrity Sub- 
system”), a selection of the algorithms used to coordinate periodic constraint 
checking and the results of our initial experiments with the system. We begin, in 
Section 121 by surveying the range of methods for checking integrity. In Section 0 
we describe the overall architecture of the LOIS system, while Section ^presents 
the algorithms for scheduling constraint checking. Our experimental framework 
and results are given in Section 0 and Section 0 concludes. 
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2 A Comparison of Approaches to Integrity Checking 

Traditional integrity checking (which we call immediate integrity checking) and 
data cleansing can be seen as two ends of a spectrum of possible approaches to 
data integrity. At one end, we check relevant constraints on every state change, 
no matter how small and insignificant; at the other end, constraints are not 
checked until some major new use for the data is identified that is hampered by 
the poor state of the data. Further, at one end of the spectrum, violations are 
never allowed to enter the database whereas, at the other, violations may exist 
in the database both before and after constraint checking. Unsurprisingly, each 
approach has advantages and disadvantages when applied in practice. 

2.1 Immediate Constraint Checking 

In immediate integrity checking, constraints are typically defined using a high- 
level declarative language. Each constraint is compiled into fragments of proce- 
dural code which check the constraint, and directives to the DBMS to execute 
these fragments when events occur which might violate it. One very common 
way of achieving this is to use the declarative constraint to generate a collection 
of active rules that will fire when potentially violating updates takes place ITunj . 
Researchers working on techniques for immediate integrity checking have recog- 
nised the difficulty of producing efficient translations of integrity constraints, 
and have directed much effort at the problem of avoiding unnecessary integrity 
checking whenever possible (e.g. I7E1EI 1. 

We have already mentioned the major problems with immediate checking; 
namely, that it can be very time-consuming, and occurs at times when processing 
capacity is at a premium. Clearly, however, some constraints are so important 
that it is worth the processing time required to enforce them continuously. The 
basic structural constraints typically associated with relational databases, such 
as key integrity and referential integrity are good examples of this kind of con- 
straint, as are “hard” constraints that ensure that an organisation’s activities 
comply with the law, or that patients in a hospital are not prescribed lethal 
doses of drugs, for example. For many constraints, however, immediate checking 
is overkill, requiring that considerable processing effort be expended to check 
validity of data that may not be used again for some weeks or months. 

A further factor that reduces the applicability of immediate integrity checking 
in practice is that it is too unforgiving for use in real applications. Inconsistent 
data is never allowed to enter the system, no matter what the context or conse- 
quences. All too often, however, such inconsistencies can reflect a failure in the 
design of the system rather than in the data itself. If they are to be usable in the 
real world, database systems must be capable of storing exceptional data that 
violates one or more of the system constraints m- 

2.2 Data Cleansing 

Data cleansing is the process of identifying and removing errors and inconsis- 
tencies in data sets, in order to prepare them for some new task; for example. 
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feeding the data into a data warehouse or using it for some form of analysis that 
is sensitive to errors in data (e.g. data mining). Detection of errors in the input 
data set(s) typically involves either: 

— guessing at the types of error that might occur, and writing customised 
queries to detect any instances of these types, 

— using a data profiling tool, such as EvokeSoft’s Axio m , or 

— using sampling to produce a subset of values for a human expert to examine 
manually m- 

Once errors are identified, they must be corrected. Unlike in immediate checking, 
where we expect to correct each error as it occurs, in data cleansing it is common 
to find the same error in many hundreds or thousands of records. Because of the 
large numbers of errors, it is necessary to apply some sort of transformation to 
the data to correct them “in bulk” [ 1211 8 | . 

The principal advantage of data cleansing is that it typically has only a minor 
impact on standard business processing. Rather than work with the original data 
source, it is much more common for data cleansing staff to extract relevant sets 
of data (e.g. as flat files), which are then examined and corrected entirely off- 
line. Extracts from mission critical systems can be performed overnight, so that 
day-to-day processing is completely unaffected. 

The principal disadvantage of data cleansing as a means of maintaining data 
quality over the long term is that it is not performed frequently enough, (i.e. 
it consumes too few data processing resources). Effectively, we allow all errors 
to enter the system unhindered, and then apply panic measures to remove the 
most expensive when their presence threatens to cost more than their detection 
and removal. In addition, data cleansing attacks the symptoms rather than the 
cause of the data quality problems. Once the cleansing effort is complete, errors 
can continue to enter the database system as frequently as they could before it. 
Moreover, due to pressure on resources, it is rare for corrections to be propagated 
back to the original source data sets. A further disadvantage of this approach is 
that the errors that are found may have been introduced many years before they 
are detected. This makes it very difficult to determine the correct repair strategy 
for the errors, as their original context may be lost completely. In contrast, the 
context of entry of errors is often (though not always) available to immediate 
checking techniques. 



2.3 Less “Extreme” Forms of Integrity Checking 

Despite the problems described above for the two ends of the integrity checking 
spectrum, very little attention has been paid to approaches which attempt to 
find a compromise between them. If immediate checking requires too frequent 
checking, and data cleansing too infrequent checking, what is the “optimum” 
frequency for checking constraints? The simplest answer to this question is to 
say that the user must decide. For example, both Cremers and Domann jS] 
and Cammarata et al. ^ have proposed integrity checking functions that can 
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be invoked by the user on demand. As this would seem to be placing both too 
much responsibility on the user, these authors have independently proposed that 
constraints could be checked periodically, at times set in advance by the user (e.g. 
at lunchtime or overnight). However, this is no guarantee that integrity checking 
will not have a detrimental effect on normal business processing, as both authors 
assume that the entire database will be checked for all constraints whenever the 
integrity function is invoked. This may be realistic for small databases, which can 
be checked in a short time, but could interfere with mission critical transactions 
in large-scale databases. 

Rather than relying on the user to request regular integrity checks, it would 
be better if the DBMS could decide whether an immediate check is necessary or 
whether a check could safely be made at some later time. Lafue has proposed 
a method for achieving this in a limited form, based on semantic integrity de- 
pendencies PS!. This proposal is founded on the observation that while there 
are theoretically many repairs for a given constraint, in practice the number 
of repairs is often much more limited and, in some cases, focusses on only one 
or two of the variables appearing in the constraint. For example, violations of 
a constraint which relates the modules chosen by an undergraduate student to 
the number of credits available from those modules would always be repaired by 
changing the module choices of the student, and not by changing the number of 
credits available from the module. 

In Lafue’s system, the user identifies the variables which can be changed 
in order to repair a constraint violation (called dependent variables) and those 
which cannot (independent variables). If a change occurs to some object which 
appears as a dependent variable in a constraint, then that constraint must be 
checked immediately. However, if an independent variable object changes, there 
is no need to check the constraint, as it cannot be that object which is at fault. 
Instead, the check can be deferred until the next time someone accesses the 
dependent variable objects associated with that constraint. Essentially, by spec- 
ifying the dependent and independent variables for each constraint, the user is 
telling the DBMS when it is “safe” to delay an integrity check. However, this 
method only provides a way of cutting down the amount of integrity checking 
that is required. Business processing will still be interrupted and delayed by the 
remaining integrity checks. An alternative approach is to adapt the idea of ask- 
ing the user to specify the times when constraint checking will be allowed, but to 
leave the DBMS with the task of deciding which constraints should be checked 
at each of these times. This is the approach that we have explored, and which 
we describe in the remainder of this paper. 



3 The “Lights Out” Integrity Subsystem 

As we have said, traditional approaches to integrity checking place unacceptable 
demands on the system resources during the peak business hours. The “Lights 
Out” Integrity Subsystem (LOIS) attempts to overcome this problem by making 
use of “off-peak” time to process the integrity constraints in a dynamic manner. 
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Table 1. Example timetable for controlling LOIS 
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The periods which constitute this “off peak business time” are specified in ad- 
vance (by the database administrator) in the form of a timetable showing the 
start and end points of periods of expected low activity. An example of such 
a timetable is shown in Table H where hours when the load on the system is 
expected to be high are marked with an ’H’, and hours when the system is ex- 
pected to be sitting idle are marked with an ’L’. Effectively, the timetable states 
when the system administrator is prepared to release processing resources for 
integrity checking and when she/he is not. Within each low-activity time slot, 
the DBMS decides dynamically which constraints to check. In general, there can 
be no guarantee that all of the integrity constraints can be checked completely 
within each time slot. Therefore, constraint checking must be scheduled across 
several time slots (i.e. periods of low system activity), hopefully in a way that 
ensures that all serious constraint violations are detected before the bad data is 
used by some major business process. 

The architecture of LOIS (Figure Q) comprises two main components: 

— the System Utilisation Monitor (SUM): this component waits for periods of 
low activity, as defined by the current system activity timetable, and initiates 
constraint checking when such a period begins or ends. 

— the Constraint Selector: when awakened by the SUM, this component se- 
lects a constraint to check (according to some predefined criteria), extracts 
the query which implements the constraint from the constraint metadata, 
executes it and records details of any violations found. This process then 
repeats until the Constraint Selector is closed down by the SUM. 

One of the aims of our work is to determine the most effective set of criteria 
that can be used for selecting the constraints to be checked in any given time 
slot. We have therefore constructed the Constraint Selector component in such 
a way that different selection algorithms can easily be plugged in to the system. 
This has allowed us to experiment with a variety of different selection criteria, 
as described later in Section 0 of this paper. 

The LOIS components are supported by three sets of metadata: 

— the Constraint Metadata: stores details of the constraints that are to be 
checked against the database. For each constraint, a unique identifier, an 
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Fig. 1. Overview of the LOIS Architecture 



SQL query that can be used to detect violations of the constraint, and an 
estimate of the time required to execute the query are stored: 

constraint(ConstraintID, SQLQueryString, Estimate) 

~ the Constraint Violation Log: violations found during the constraint checking 
process are stored in a constraint violation log table, for subsequent browsing 
or analysis by the database administrator. The violation log table has the 
following structure: 

violation(ViolationId, Constraintid, TahleName, TableNameRowId, 

TimeDetected) 

— the Timetable Metadata: the start and end times of the expected periods of 
low activity are stored within the following table: 

timetable (Periodid, StartTime, EndTime, TimetableLength) 

A timetable may be of arbitrary length (a week or a month are typical 
lengths) and may contain an arbitrary number of low and high periods of 
activity. The timetable metadata records the start and end times of each 
period of low activity, as a relative time (in minutes) from the start of the 
timetable cycle. The timetable length (in minutes) is also stored with each 
period for convenience. 

Additional metadata is required by some of the selection algorithms that we 
have experimented with. This is described in the following section. 
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4 Constraint Selection Criteria 

The heart of LOIS is the constraint selector component, which determines which 
constraints will be checked in any given period of low system activity. Since, in 
general, it will not be possible to check all constraints in each time slot, LOIS 
must use some selection criteria in order to identify the subset of constraints that 
can most usefully be checked in any one period. Clearly, one obvious element 
of these criteria is the amount of time required to check a constraint: there 
is no point in wasting time trying to check a constraint that requires 3 hours 
of processing in a time slot that lasts for 1 hour. Another element might be 
the degree of importance attached to each constraint by the user, so that more 
important constraints are checked more quickly (or more frequently) than less 
important ones. 

The basic algorithm used by the LOIS constraint checker is as follows: 

load selected constraint metadata into in-memory structures 
loop 

determine amount of time left in low activity period 
select a constraint according to the current selection criteria 
if no such constraint can he found then 
exit 

else 

evaluate the constraint query 

log any violations found 

update constraint metadata if appropriate 

end if 
end loop 

This simple algorithm is invoked every time the constraint checker receives a 
signal indicating that a period of low activity has begun, and continues to ex- 
ecute until interrupted by a signal that the period has ended or until no more 
constraints can be found to check. 

In order to determine the most useful selection criteria in practice, we have 
experimented with three different selection algorithms, based on different char- 
acteristics of the integrity constraints to be checked: 

— an estimate of how long each constraint will take to check, 

— a user-defined partial ordering on constraints, describing their relative im- 
portance, and 

— user-supplied information about the maximum amount of time that may 
elapse between checks of each constraint. 

We now describe how these characteristics are exploited by each of our three 
constraint selection algorithms. 
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4.1 Trivial Constraint Selection 

The principle underlying the simplest of the three selection algorithms is that we 
should choose the largest constrain10 that can be evaluated in the time remaining 
within the current time slot. In order to support this, we store an estimate of 
how long each constraint query takes to evaluate in the constraint metadata. The 
estimate is automatically maintained by LOIS, which keeps track of how long 
each constraint query takes to run in practice, and computes a sliding average 
of the time required over each execution. 

However, an algorithm which simply chose the largest available constraint 
at each point would repeatedly select the same set of constraints to check over 
and over again, when we would prefer LOIS to systematically work through the 
whole set of constraints. We therefore keep track of which constraints have been 
checked by LOIS, and modify the selection criterion to be: 

Choose the largest constraint that can be checked in the time remaining 
that has not been checked in the current round. 

Once all constraints have been checked, we mark all constraints as “unchecked” 
and the cycle of checking can begin again. 

This selection criterion will eventually check all constraints, but it has a 
disadvantage that some time available for constraint checking may be wasted 
unnecessarily. For example, suppose that 15 minutes of checking time remain, but 
that no unchecked constraint can be found that can be executed in this amount of 
time. In this case, it would be more productive to recheck one of the constraints 
that can be checked in this time. This is far from pointless, as the constraint 
in question may have been checked in some previous period of low activity, and 
violations may have been created in the database in the meantime. We therefore 
extend our earlier selection criteria to take this possibility for rechecking into 
account: 

Choose the largest constraint that can be checked in the time remaining 
that has not been checked in the current round. If no such constraint 
exists, then choose the largest constraint that can be checked in the time 
remaining. 

The algorithm which schedules constraints according to these criteria is given in 
Appendix E] 



4.2 Prioritised Constraint Selection 

The philosophy underpinning the previous algorithm is that “all constraints 
are equal”, and that therefore equal weight should be given to the checking 
of each. In reality, however, some violations are potentially more serious than 
others, and therefore some constraints should be given priority over others. Our 

^ By ’largest’ here, we mean the constraint which will take the longest to check. 
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second algorithm recognises this fact, by making use of a user-defined partial 
ordering over constraints (stored in the constraint metadata) to ensure that the 
more important constraints are checked more frequently. With this additional 
information, our selection criteria become: 

Choose the highest priority constraint that can be checked in the time re- 
maining that has not been checked in the current round. If more than one 
such constraint exists, then choose the “largest” such constraint. If no 
such constraint exists, then choose the largest, highest priority constraint 
that can be checked in the time available. 

We use the same algorithm for prioritised selection as for trivial selection, except 
that the constraints are orded first by their priority, and then (when several 
constraints have the same priority) by their size (i.e. query estimate). 



4.3 Cyclic Constraint Selection 

While a priority mechanism such as that described above is helpful, it can only 
affect the order in which constraints are checked. In reality, it may be more 
useful to try to control the frequency at which they are checked. For example, 
a constraint which guards against important violations in frequently updated 
data should be checked on a more regular basis than a constraint that prevents 
a minor error in a more static set of data. For our final algorithm, therefore, we 
allow the user to associate a “cycle time” with each constraint, which indicates 
the maximum amount of time that can be allowed to go by without checking 
the constraint. This is stored in the constraint metadata, along with details of 
when each constraint was last checked by LOIS. 

LOIS then uses this information to select constraints whose cycles are al- 
most over in preference to those that have longer to wait before they must be 
checked. As with the other algorithms, only constraints which are small enough 
to be checked within the period are selected, and when there is a choice of con- 
straints with the same cycle deadline, the largest possible constraint is selected. 
Of course, we cannot always guarantee that every such deadline will be met, par- 
ticularly if constraint cycle times are too small or the timetable is too tight. The 
constraint checker must therefore behave appropriately when some deadlines are 
missed. We have chosen to give highest priority to constraints which are most 
overdue, so that the algorithm attempts to “catch-up” on missed deadlines as 
soon as possible. The risk with this approach is that more and more constraints 
may not be checked by their deadline, as the algorithm tries to catch up on its 
backlog of overdue work. 

The algorithm used for Cyclic Selection is given in AppendixIEl For this form 
of scheduling, the constraints are ordered by ascending amount of time remaining 
before each constraint must be checked. For constraints which are overdue, this 
value will be a negative number, and therefore the most overdue constraint will 
appear at the very beginning of the list of constraints to be checked. However, 
unlike the other two algorithms, this form of selection has the added overhead 



LOIS: The “Lights Out” Integrity Subsystem 



67 



Table 2. Example constraints derived from the CAS schema 



If PersonTitle = Mr then PersonGender = M 
TotalStudentsEnrolled = Sum(TotalFemale + TotalMale) 

If RegDisabled = N then DisabilityCode = 00 

If RegDisabled = Y then DisabilityCode != 00 and in Range (01 - 99) 



of having to update this ordering dynamically throughout each time period, 
whereas the other algorithms assume that the ordering will remain fixed within 
each low activity period. 

5 Experimental Framework 

In order to evaluate the performance of the three selection algorithms we have 
constructed an experimental framework, based on a real legacy database system, 
that allows us to compare the patterns of constraint checking achieved by each 
one. The legacy system that our framework simulates is the Course Adminis- 
tration System (CAS) used by the Department of Lifelong Learning at Cardiff 
University m- CAS holds detailed information on courses, tutors, students and 
mailing lists for marketing purposes, and currently contains something in the 
region of 70,000 records. Typical transaction volumes are low, at about 200 per 
day. The system is implemented in COBOL on top of the TurboIMAGE/3000 
Database Management System from Hewlett Packard. Unfortunately the source 
code to CAS is not available, and as a result it was not possible to determine the 
integrity constraints enforced by the original application. Instead, we analysed 
the schema and samples of the data, and identified some eighty useful integrity 
constraints of varying degrees of complexity. From these, we excluded all struc- 
tural constraints of the kind better implemented using the declarative constraint 
facilities provided by most commercial DBMSs, and simple constraints on indi- 
vidual tuples as these are better checked using immediate checking. This left 
some sixty-two more complex constraints (e.g. those involving multiple tables or 
aggregate functions), which are more appropriate for checking under the LOIS 
approach. Table El shows some examples of the types of constraints identified for 
checking with LOIS. 

To further develop the simulation of the legacy system, we analysed a number 
of key business processes within the Department of Lifelong Learning in order 
to determine typical data set sizes and transaction patterns. Using this informa- 
tion, we constructed a program to generate a sequence of random transactions 
representative of six months’ worth of data updates within CAS, scaled down in 
volume so that they can be executed in just six hours within the LOIS system. 
We also constructed a scaled-down timetable to simulate the periods of low and 
high activity in the CAS working week (see Figure 0. 

We then executed the simulated transaction sequence three times (starting 
from the same starting state each time), with LOIS configured to use a different 
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selection algorithm for each run. The graphs in Figure 0 show the number of 
times each constraint was checked using trivial selection (top graph) , prioritised 
selection (middle graph) and cyclic selection (bottom graph). Each bar on each 
graph represents a constraint, and in each case constraints are ordered by size, 
with the most time-consuming constraint appearing at the left and the least at 
the right. 

Using trivial selection, LOIS managed to check almost all the constraints at 
least once, with just four constraints never being checked. These are all quite 
large constraints which do not fit easily into any of the available time slots. In 
addition, the algorithm seemed to “get stuck” on three constraints, which it 
checked a very large number of times. The performance of the prioritised selec- 
tion algorithm is much better. Again, four constraints are never checked, but the 
vast majority of constraints are checked at least five times. The algorithm only 
got “stuck” on one particular constraint. Interestingly, cyclic checking was least 
successful of the three. It clearly checks some of the constraints more frequently 
than the other two algorithms, but also ignored twenty-nine of the constraints 
completely. This is partly due to the additional overhead involved in this third se- 
lection algorithm, which means that less time is available for constraint checking, 
but it may also reflect the difficulties of specifying a set of cycle time deadlines 
which can be achieved within a given timetable. 

These results raise the question as to why prioritised selection seems to spend 
more time in checking constraints than trivial selection, when the two algorithms 
are so similar. In order to understand this, it is necessary to consider the amount 
of time “wasted” by each algorithm when it is interrupted in the middle of 
checking a constraint, at the end of the period of low activity. This is shown by 
the set of graphs in FigureQ, which gives the number of times each constraint was 
aborted during checking by each of the three algorithms. We can see from these 
graphs that trivial selection results in far more time being wasted by aborted 
constraint checks, when compared to prioritised selection. Cyclic selection wastes 
still less time, but this again would seem to be due to the increased overhead 
of this algorithrr0- These initial results would seem to indicate that the use 
of priorities is the most effective means to select constraints for “lights out” 
integrity checking. 

6 Conclusions 

We have described the design and evaluation of a new approach to integrity 
checking. Unlike traditional integrity checking mechanisms, which do all checking 
at the time the update is made, our approach makes use of the times when normal 
database activity is low to undertake constraint checking and therefore does not 

^ This diagnosis can be further supported by our results. Since we know how long each 
constraint takes to check, we can use these results to estimate how long each algo- 
rithm spent doing “useful” constraint checking activity. Prioritised selection spends 
3288s, trivial selection spends 2740s while cyclic selection spends just 2660s in suc- 
cessful constraint checking. 
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interfere with the processing of critical business transactions. This has a number 
of additional advantages: 

— The system allows inconsistent data to be entered into the database. While 
this might appear to be a disadvantage, experience from real database ap- 
plications indicates that the stringent requirements of full data integrity 
imposed by traditional approaches is unworkable in practice p). 

— The responsibility for repairing integrity violations is moved from the end- 
user/data entry point (where the information required to make appropriate 
corrections is often not available) to the data quality or database adminis- 
tration team. 

— LOIS allows “low priority” constraints to be entered into the system to 
check for rare errors, or to report occurrence of data patterns that might be 
indicative of an anomalous situation, without impacting on business-critical 
processing rates. 

A major disadvantage of the LOIS approach is that certain important kinds of 
error may not be discovered until after the time at which they can be corrected. 
For example, a telephone sales operator may have been able to correct an error 
through dialogue with the customer, if warned at the time of data entry. This 
is exactly the sort of situation that traditional approaches to integrity checking 
handle well. However, it is important to recognise that the LOIS approach is 
complementary to, and not a replacement for, traditional methods of integrity 
control. The developer of the database system can choose to implement some 
constraints using immediate checking (e.g. checks on allowed value ranges, checks 
that some data values are unique or non-null) and others using LOIS (e.g. com- 
plex constraints involving aggregates, or navigation across several tables) — thus 
gaining the advantages of both approaches. 

Having established the basic feasibility of a system based on “lights out” 
integrity checking, we are now investigating ways in which constraint selection 
can be made more appropriate for use in practice. For example, more efficient 
constraint selection could be performed if LOIS made use of the timetable infor- 
mation to pre-plan optimal (or near optimal) schedules for constraint checking 
off-line. Or, one could make use of the transaction log to focus constraint checking 
on areas of the database which have been updated recently. In addition, we are 
investigating the alternative approach of using a real-time monitor to switch on 
integrity checking when the load on the DBMS falls below some pre-determined 
threshold value. This will enable constraint checking to occur in actual periods of 
low activity, rather than in predicted periods of low activity. A further potential 
application of LOIS is in fraud detection, where the LOIS engine could be used 
to search for data patterns that may indicate fraudulent usage of the system, 
without impacting on day-to-day business processing. 
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A Trivial Constraint Selection Algorithm 

NC ^ total number of constraints 
N ^ NC 

load constraint IDs into array constraints^ in descending order of size 
load estimates into array estimatesfj 
load checked table into array checkedf] 

loop 

if N = 0 then 

% All constraints have been checked. Begin the cycle again 
N ^ total number of constraints 
delete all tuples in checked table 
reinitialise checkedf] with false values 

end if 

% Calculate time left in period 
TL ^ end of period - now 

% Find the first constraint that has not been checked and 
% that can be checked within the remaining time 
1 

while (checked[I] or estimatefi] > TL) and I < NC do 
I + 1 

end while 

% If no such constraint exists 
if I > NC then 

% Find the next constraint that can be checked in the time left 
I^ 1 

while estimatefi] > TL and I < NC do 
I^ I + 1 
end while 

if I > NC then 

% No constraint will fit in the remaining time 
exit 

else 

% Check the selected constraint 

retrieve and evaluate query for constraint[I] 

log details of any violations found 
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end if 

else 

retrieve and evaluate query for constraint[I] 
log details of any violations found 
add checked(constraint[I]) to constraint metadata 
checkedfl] true 
N ^ N - 1 

end if 
end loop 

B Cyclic Constraint Selection Algorithm 

NC <— total number of constraints 

calculate time to next deadline for each constraint (now - timeLastChecked 
+ frequency) 

load constraint IDs into array constraints)] ordered by time to next deadline 
load estimates into array estimates]] ordered by time to next deadline 

loop 

% Calculate time left in period 
TL <— end of period - now 

% Find the first constraint that can be checked within the time left 
1 

while estimate]!] > TL) and I < NC do 
I + 1 

end while 
if I > NC then 

% No such constraint can be found 
exit 

else 

% Check the selected constraint 

retrieve and evaluate query for constraint)!] 

log details of any violations found 

update timeLastChecked of constraint)!] (= now) 

% Recreate arrays based on new time to deadline for each constraint 

calculate time to next deadline for each constraint 

load constraint IDs into array constraints)] ordered by time to next 

deadline 

load estimates into array estimates)] ordered by time to next deadline 

end if 
end loop 
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Abstract. The great variety of CASE tools available on the market implies a 
need for data interchange. One approach to satisfying this need is the export and 
import of models. For this to be vendor independent requires standardized 
common interchange formats, either in the form of meta-models or a common 
transfer format. CASE tools use some type of explicit or implicit design 
transformations to transform different types of models, for example conceptual 
to logical. The transformations are important for interchange since a set of 
models which are consistent in one tool may be inconsistent in another tool that 
does not support the same set of transformations. Subsequent modification in 
the latter tool may lead to irresolvable inconsistencies. In this paper we define a 
common, model independent notation for design transformations to facilitate 
interchange between tools so that the meaning of different transformations can 
remain consistent between different CASE tools. The proposal is made in the 
form of a conservative extension to OCL. A run-time interpreter for the 
extension has been built. 



1 Introduction 

Currently there is great variety in the different CASE tools marketed by a large 
number of vendors. It is common in practice for design teams to use more than one 
CASE tool to support the different phases of design in their proprietary design life- 
cycles. There is thus a need for data interchange between the tools offered by 
different CASE tool vendors [4]. In order to interchange models between different 
CASE tools, standardized common meta-models and common transfer formats have 
been defined. Transfer formats such as CDIF [8] and XMI [11] case tool interchange 
formats allow interchange of CASE tool models using a standardized, vendor 
independent format. 

However, many CASE tools use some type of explicit or implicit design 
transformations to transform different types of models. For example, a conceptual ER 
design may have been transformed by one tool into a specific logical RM. Such 
transformations are important for interchange since a set of models which is 
consistent in one tool may be inconsistent in another tool that does not support the 
same set of transformations. If the models produced using one tool are modified by a 
tool that does not support the design transformations from that tool, irresolvable 
inconsistencies could potentially be introduced between the tools. 
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We maintain that there is a need for a representation of design transformations 
which allows such transformations to be shared and interchanged between different 
tools. Such transformations are, however, rarely made explicit. While the metadata in 
a tool can be interchanged using standardized interchange formats, to our knowledge, 
no standard notation for language and platform independent definition and subsequent 
interchange of transformations exists. 

The goal of the work reported here is to define a common, model independent 
notation for design transformation interchange between tools, so that the meaning of 
different transformations will remain consistent between different CASE tools. 

The result of this is a proposed extension of the MOF UML-based meta-model [13] 
which will allow the interchange of transformations as part of a model, together with 
an extension of the OCL language that can be used with the meta-model to describe 
transformations. A prototype system has been implemented, which supports the 
representation and execution of transformations. Extensions to the OCL interpreter to 
support transformations amount to less than 10% of the complete interpreter code. 



2 Previous Work 

In recent years, work in the development of tools for database applications has shown 
that various types of transformations can be used to perform many of the common 
modelling tasks in database design. A survey of various approaches to database 
design that utilise transformation can be found in [5]. Support for such design 
transformations is now expected of tools in the area. However, little progress has been 
made in the area of general interchange of design transformations between CASE 
tools. 

One important characteristic of the database design processes that utilise design 
transformations as presented in textbooks [2,3] is that they are often defined in the 
form of a set of steps which together form a transformation procedure. Some of these 
steps specify alternative transformations which can be selected by the user of the 
procedure. 

Very few approaches have tried to represent transformations and transformation 
procedures directly. The DB-main meta-CASE environment [6] offers support for 
alternative design transformations and transformation procedures. The main flaw in 
the approach from the perspective of this paper is the use of a proprietary procedural 
programming language called Voyager [6] for the implementation of the various 
transformations. Interchange of the transformations and the transformation procedures 
would thus be a very difficult task, since the constructs expressed in the Voyager 
Language would have to be translated to the specific language of the target CASE 
tool. 

An alternative approach to design transformations [10] utilises logics and graph- 
oriented constructs to define transformations, making the transformations declarative 
and thus less dependent on specific programming languages. The approach utilises a 
graph- oriented meta-model with a logic programming implementation. However, 
their graphical model is proprietary and, as presented in the literature, does not 
directly support transformation procedures and the use of alternative transformations. 

Our own concern has been to develop a declarative transformation language for 
supporting Object-Oriented information systems design. In this context, one language 
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which is an obvious contender for defining transformation constraints is the OCL 
language, developed for defining constraints in UML models. OCL is declarative, 
platform and language independent, and has the advantage of being a part of the UML 
standard [12]. Different implementations of OCL interpreters exist, one 
implementation of the OCL language is provided by IBM [7], which contributed the 
OCL language for UML 1.1. The largest problem with using the OCL language to 
define design transformations is that it is, in its original guise, side effect free. It thus 
requires extension to support design transformations for inter-tool development of 
designs incorporating transformations. Another extension of the OCL language which 
supports side effects does exist [9]. However, that extension does not allow the 
modification of models; rather it provides the language with a shorthand for posting 
events. Since it is not guaranteed that the source and destination CASE tools support 
the same set of events or methods, such an extension does not actually convey the 
semantics of the modification and is thus not sufficient for transformation 
interchange. 



3 A Conservative Extension of the OCL Language 

In line with OCL’s status as a standard language requiring implementation in any 
compliant CASE tool, any suggested changes to OCL must be kept to a minimum. In 
recognition of this, the proposed extensions of the language have been designed to be 
localised to a small set of concepts. Migration of an existing OCL interpreter should 
be straightforward, and require modest effort. 

The proposed extensions form two distinct categories: extensions that are needed 
to update models, and extensions that will allow the specification of rules. 



Extensions to Provide Support for Object Modification 

The extensions in this category together form the set of extensions that are required to 
perform updates to models. 

The first and most important extension is the addition of an assignment operator. 
The assignment operator, denoted by the symbol is used to assign values to 
properties or roles. Since the same operator is used both to assign properties and roles, 
the extension is localised to a very small part of an OCL interpreter. 

The semantics of the assignment of properties is very straightforward. When an 
assignment statement is executed, the relevant repository object is modified to reflect 
the change of state. 

The semantics of the assignment of roles, i.e. the various types of collections, is not 
as simple. When an assignment of a collection is performed, the old content of the 
collection is removed, and the new content is inserted in its place. This means that if a 
new entry is to be added to a collection, that new object is added to the existing 
collection by assigning the collection the value of the old collection but including the 
new value. 

In order to perform any transformation of a schema, it is necessary to be able to 
create new schema objects. An operation has thus been added to the language which 
creates a new object. The create operator can be applied only to aliases which 
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originate from rule declaration statements. It is thus illegal to change the object which 
is referred to by the 5e// alias by calling the create method. 

Analogous to object creation, it is also desirable to be able to delete a modelling 
object from a model. In order to maintain schema consistency, the object deletion 
operation should also preserve referential integrity by cascading the delete. When an 
object is deleted, it is thus simultaneously removed from all collections which refer to 
it. The application restrictions of the creation operator do not apply to the deletion 
operator. It is thus possible to delete the object that the i'e// alias refers to, given that 
no modifications of the object occur after it has been deleted. 

In order to be able to group several modifications together within a single extended 
OCL expression a symbol was introduced to delineate the components of an 
expression that are to be interpreted separately. This separator symbol is denoted by 
In order to comply with the basic design of the OCL language, all successful 
components evaluate to „true„. The symbol is thus only a shorthand for the logical 
AND operator. 



Extensions to Support Rules 

The OCL language supports aliases, which allow the creation of conditions which 
span several levels of operations. The most well known alias is the self alias, which 
refers to the context object. Declarations of such aliases can be made in the rule 
declaration part of a transformation rule. 

If the OCL language is to be used to specify rules that create objects and modify 
the newly created objects, the language must include some means of declaring the 
objects that are to be created. The ordinary alias mechanism can thus be extended in 
scope to allow declaration aliases, maintaining the standard aliases without any 
modification to the OCL interpreter. When a declaration alias is passed to an OCL 
interpreter the only difference from a standard alias is that it can be used to create new 
objects in a model. 



4 Extending the Meta-model to Support Transformations 

A language for expressing transformations is not sufficient on its own for the 
representation and subsequent interchange of models and the transformations which 
make the constraints between the models explicit. To express a transformation in a 
transferable way requires a common meta-model for expressing OCL rules as well as 
other model components. 

The chosen representation must support a representation of the transformation 
procedure, the individual transformation rules which together make up the procedure, 
and the patterns which supply a user with a choice of alternative transformations for a 
specific object. 

Any object in a meta-model, such as the ER entity class, is associated with a single 
model which groups all of the associated classes together. A transformation procedure 
groups a set of transformation rules together and associates the procedure with a set of 
models between which it performs some type of transformation. The instantiated 
classes contained in the model, for instance an EREntity with the name person, may 
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also refer to a transformation pattern which will allow the user to select a specific 
transformation for that particular modelling object. This will allow interchange of the 
choices of transformation that a user has made together with the meta-model and the 
transformation procedures. 

The extended meta-model features needed for handling the proposed extension 
appear in figure 1. The transformations are thus represented at the model level. This 
allows a model element to directly refer to the transformation patterns. The 
transformation information would then be interchanged as part of the model; it is thus 
model specific rather than general. The interpretation of the OCL statements, 
however, must take place at the meta-model level, i.e. at a level that is higher than 
that at which they are represented. The only limitation with this is that the 
transformations cannot transform the procedures or the rules, because such self 
modification would potentially introduce anomalies. 




Fig. 1. Transformation meta-model framework. 



5 Example Schemas 

In order to demonstrate the proposed approach we present a concrete example. 



An Example Meta-model 

One type of transformation which is common and well described in textbooks on 
database design (Insert reference!) concerns the transformation of an enhanced entity- 
relationship diagram into a corresponding set of tables for the relational model. In the 
textbooks this type of transformation requires the designer to follow a specific 
procedure, described as a number of steps. For each of these steps there is often a set 
of alternative transformations which can be used to produce different sets of tables for 
the same entity-relationship diagram. One of the most complicated steps in such an 
algorithm is the transformation of inheritance hierarchies. For each inheritance 
connection there are at least three main categories of transformation possible. One 
option is to represent inheritance by using a foreign key. A second option is to move 
the attributes of the supertype into the subtype. A third option involves moving the 
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attributes from the subtypes into the supertype. In the two last options either the 
supertype or the suhtype will not have a directly corresponding table, and the 
attributes and relationship associated with that entity need to be redirected to the 
correct corresponding table. The behaviour of this transformation can become even 
more complex, since the supertype relationship is potentially recursive, allowing 
supertypes of other supertypes. In order to show the function of the transformation 
language in a somewhat realistic setting, the example was devised to illustrate this 
complex behaviour. 

The meta-model devised is an exemplar, designed explicitly as a representative 
test-bed of the transformation procedures developed. Its representation of entity 
relationship models and relational tables is necessarily simplified to reduce the 
complexity of the examples; additional interconnections could have been added to 
give an even higher degree of modelling transparency. The provision for alternative 
transformation patterns has been restricted in the example to the case of the subtype 
relationship. Furthermore, the pattern reference has been simplified in the example by 
using a property (see, for example, rule 1 below). 

The upper part of the diagram (Figure 2) contains the generalised representation of 
an Entity Relationship notation, supporting binary relationships and inheritance 
hierarchies. The class named ERDependency represents dependencies between 
entities, defining which entities will be used instead of another entity if it does not 
directly transform to a table on its own. The lower part of the diagram, helow the 
dashed line, contains the generalised representation of the relational data model used 
for this example. The Primary key dependency class is used to represent a 
dependency between two tables where the foreign key is also made the primary key in 
the destination table. 



Example Rules 

The transformations are specified using the OCL language to create the condition for 
the transformation and the extended OCL language to specify the action of the 
transformation rule. The main outline of the rules is similar to the transformation 
procedures that are provided in common textbooks in the database area. There are, 
however, some differences. Due to the nature of the procedures as they are presented 
in a textbook (namely, as algorithms) they do not directly correspond to 
transformation rules. 

The most obvious difference is that the subtype/supertype relationships are 
transformed in the beginning of the transformation process as opposed to at the very 
end of a transformation algorithm. This is because the associated transformations will 
effect the creation of tables for the entities, and will thus need to be performed before 
transformations which modify them. One other notable difference is the 
transformation of the attributes as separate objects and not as part of an entity. It was 
decided to have separate transformations for the attributes in order to gain flexibility 
of transformations. Finally, support for inherited relationships would require separate 
transformations. 
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Fig. 2. The example meta-mode 
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A few simplifications of the rules and the meta-models have been made in order to 
make the example easier to read and understand. In this example, the separate class 
for transformation patterns has been reduced to a property. This property is 
furthermore only present in the supertype relationship class. All references to the 
transformation patterns are thus reduced to conditions that work with this property. In 
a more realistic schema, patterns would be represented as classes and references, and 
patterns for all of the classes would be provided. In order to reduce the size of the 
rules and the size and complexity of the example meta-model, the model class, which 
represents the models containing the other modelling concepts, has been removed. In 
a more realistic schema, all other modelling objects would be contained in a model. 



* ModelElelement has been omitted for simplicity; it is a superclass of all classes. 
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A few relationships between classes, such as for example the fact that a primary 
key dependency defines a foreign key, have also been removed to reduce the size of 
the model. The absence of these relationships reduces the modeling transparency [1] 
of the models, but the information can still be derived from data that is present in the 
models. 

A transformation algorithm for models of the complexity of the example meta- 
model in this paper requires a large set of rules. The set of transformation rules 
presented in this paper is thus only a subset of those required. The set of rules selected 
for the example contains the rules that are required by the example schema; all rules 
not necessary for the example schema have been left out of the example set. The 
ordering among the rules is, however, the same as it would be in the complete set for 
the schema. 

Rule 1: Performs the transformation of downward inheritance into downward 
dependencies. A rule that is similar to this rule performs the upward inheritance 
transformations. The condition of this rule will assure that the inheritance relationship 
is only transformed if the supertype has no supertypes that have not yet been 
transformed, and that the object that is to be transformed is of the correct type. The 
action performs the creation of a dependency going from the supertype to the subtype 
entity, and ensures that all dependencies of the supertype are inherited by the subtype. 
The action statement also ensures that no table will be generated for the supertype. 

Class: ERSubtypeRel 

Condition : FromEntity . FromSTRel- >f orall (Pattern= "D" 

implies Def inesDependency- >notempty AND 

FromEntity . ToSTrel - >Forall (Pattern="U" implies 

Def inesDependency- > notempty) AND pattern = "D" AND 

Def inesDependency- >isempty 

Declaration: ERDependency El 

Action: FromEntity . NoTable : =true ; El. create, • 

Def inesDependency : =E1 ; El . Type : = "D" ; 

El . FromEntity: =FromEntity ; ToEntity . FromDependency : = 
Toentity- >FromDependency- >union 
(FromEntity. FromDependency) ->including (El) 



Rule 2: Transforms an entity definition into a table definition. The condition for this 
rule ensures that only the entities that should have a corresponding table will be 
transformed. The action statement generates the table and connects the table with the 
corresponding entity. 

Class: EREntity 

Condition: NoTable=f also and ImplementsTable->isempty 

Declaration: RelTable T1 

Action: Tl. create; T1 .Name : =self .Name ; 

T1 . ImplementsEntity : =T1 . ImplementsEntity 
->including (self) 
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Rule 3: Performs the transformation of attributes in an entity that has a corresponding 
table connected to it. The condition ensures that the entity the attribute is contained in 
has a corresponding table and that the attribute has not already been generated. The 
action section creates a corresponding relational attribute and connects that attribute 
to the correct table. 

Class: ERAttribute 

Condition: ParentEntity . NoTable=f alse and 
ImplementsAttr- >isempty 
Declaration: RelAttribute A1 
Action: Al. create; A1 .Name : =self .Name ; 

A1 . KeyState : =self . KeyState ; 

Al . ImplementsAttr := self; 

Al . ParentTable : =ParentEntity . ImplementsTable 

Rule 4: Performs the slightly more complex task of creating the attributes that are 
connected to entities that have no directly corresponding table. The condition and 
action for this rule are basically the same as in the third rule. The difference is that the 
entity may have several dependants. Iteration is thus necessary so that the attributes 
are added to all of the dependants. 

Class: ERAttribute 

Condition: ParentTable . NoTable=true and 
ImplementsAttr- >isempty 
Declaration: RelAttribute Al 

Action: ParentEntity . ToDependency- >iterate (DP | 

DP . ToEntity- >rej ect (NoTable=true) - >iterate (El j 
Al. create; Al .Name : =self .Name ; 

Al . ImplementsAttr :=self; 

Al . ParentTable : = El . ImplementsTable) 

Rule 5: Performs the task of creating a dependency that specifies that the key in the 
corresponding table is dependent on the key in some other table. The condition 
guarantees that only relationships of the correct type, which have not previously been 
transformed, are transformed. Since different actions are performed depending on 
which of the connected entities have corresponding tables and which do not, this 
condition also specifies the correct condition to guarantee that the correct action is 
performed. The action then creates one key dependency for each of the dependants of 
the affected entities. 

Class: ERRelationship 

Condition: Type="I" and FromCardinality= " 1 " and 
ToCardinality= "N" and ImplDep- >isempty and 
FromEntity . NoTable=true and ToEntity . NoTable = false 
Declaration: PrimaryKeyDep PKl 

Action: FromEntity . ToDependency- >iterate ( DP | 

DP . ToEntity- >rej ect (NoTable=true) ->iterate ( El | 

PKl . create ; PKl . ImplementsRel : =self ; 

PKl . ToTable : =E1 . ImplementsTable ; 

PKl . FromTable : =ToEntity . ImplementsTable ) ) 
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Rule 6: Functionally very similar to the fifth rule, with the difference that the rule 
only affects relationships that have many-to-many cardinality. The action is more 
complex since one table and two dependencies have to be created. 

Class: ERRelationship 

Condition: FromCardinality= "N" and ToCardinality= "N" 
and ImplementsFK- >isempty and FromEntity . NoTable=f alse 
and ToEntity . NoTable=true 

Declaration: ForeignKey FKl , Reliable Rl, ForeignKey 
FK2 

Action: FromEntity . ToDependency- >iterate ( DP | 

DP . ToEntity- > reject (NoTable=true) - >iterate ( El | 

Rl. create; Rl.Name:= FromEntity .Name . concat 
(ERRelationship .Name) . concat (El .Name) ; 
self . ImplementsTable : =R1 ; FKl . create ; 

FKl . ImplementsRel : =self ; FKl . ToTable : = 

FromEntity . ImplementsTable ; FKl . FromTable : =R1 ; 

FK2 . create ; FK2 . ImplementsRel : =self ; 

FK2 . ToTable : =E1 . ImplementsTable ; FK2 . FromTable : =R1) ) 

Rule 7: Performs a cascade of primary keys, required by primary key dependencies. 
The condition guarantees that the same primary key dependency will not be 
transformed several times and that the source table of the dependency does not have 
any dependency that is not transformed. The action statement iterates through the 
keys and creates a foreign key between the tables. 

Class: PrimaryKeyDep 

Condition: FromTable . FromPKDep- >forall ( ImplementsAttr- 
>notempty) 

Declaration: RelAttribute A1 , ForeignKey FKl 
Action: FKl. create; FKl . FromTable : =FromTable ; 

FKl . ToTable : =ToTable ; FromTable . ChildAttribute- > 
re j ect (KeyState=f alse) - >iterate ( RAl | Al. create; 

A1 . ImplementedbyFK: =FK1 ; Al . Name : =RA1 . Name ; 

Al . KeyState : =RA1 . KeyState ; Al . ImplementedbyDep : = self; 

Al . ParentTable : =ToTable) 

Rule 8: Creates foreign keys and the attributes needed to represent each foreign key. 
The condition guarantees that the foreign key dependency has not already been 
generated. The action iterates through all of the key attributes in the source table and 
creates corresponding attributes in the destination table. 

Class: Foreignkey 

Condition : ImplementedbyFK- >notempty 

Declaration: RelAttribute Al 

Action : FromTable . ChildAttribute- > 

re j ect (KeyState=f alse) - >iterate ( RAl | Al. create; 

Al . Name : =RA1 . Name ; Al . KeyState : =RA1 . KeyState ; 

Al . ImplementedbyFK :=self; Al . ParentTable : =ToTable) 




CASE-Tool Interchange of Design Transformations 



85 




Fig. 3. The example ER diagram 



Example Schema 

The example schema consists of an inheritance hierarchy that represents a number of 
different categories of vehicle used by a fictitious police department, described using 
the information engineering enhanced entity relationship notation (reference here!). 
The highest level entity {vehicle) contains the attributes that are common to all of the 
different types of vehicles. The vehicle entity has two subtypes, motorcycle and auto. 
The motorcycle entity has an attribute which describes the type of motorcycle. All of 
the different types of auto have an attribute which lists the number of police officers 
that can be seated. Van has an attribute that describes the number of people other than 
police officers that can be seated in the vehicle. PatrolCar has an attribute that 
describes which type of marking the car has. The two final types of patrol car are the 
cruiser, which may have the capability to record video from the dashboard, and jeep, 
which may have extensive off-road capabilities. The vehicles do not, however, have 
any attribute that is unique for each vehicle. Each vehicle is identified by the number 
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of the vehicle together with the name of the motor pool of which it is a part. There is 
thus an identifying relationship going from the motorpool entity to the vehicle entity. 
A motor pool has two attributes, the name of the motor pool and an attribute 
describing the location of the motor pool. Each of the autos may have a set of log 
entries corresponding to it, containing information about some event pertaining to that 
vehicle. Each log entry may be connected to several vehicles, and one vehicle may be 
connected to several log entries. Each log entry is identified by its unique 
identification number, and each log entry also contains a description of the log entry. 
In order to demonstrate different types of transformation patterns it has been decided 
that the vehicle and auto types will use a downward inheritance transformation, and 
the patrol car types will use upward inheritance. 

The resulting schema (figure 4) consists of seven tables, and a number of columns. 
The tables are represented by rectangles containing the name of the table in the top 
partition of the rectangle. Primary key columns are underlined, and foreign keys are 
represented by arrows connecting the various tables. 

The recursive inheritance hierarchies have successfully been transformed so that, 
for instance, the colour attribute is contained in several of the different tables. The 
transformation rules have also appropriately created multiple instances of the 
identifying relationships so that patrolcar, van and motorcycle each has the name 
attribute as part of its primary key. Instances of the many to many relationship have 
also been created, to both the patrolcar and van tables, while retaining the composite 
primary keys of those tables. 




^ The arrow symbol denotes a foreign key dependency 



Fig. 4. The resulting relational tables 



6 Conclusions and Future Work 



In this paper we have presented an extension of the standardised declarative language 
OCL which, together with an extended meta-model also presented, allows for fuller 
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interchange of models between CASE tools. In particular, it caters for the exchange of 
transformation information in multi-model situations. In such situations, 
transformations express constraints between models based on the systematic 
transformation of one into the other. Such transformations are not only made 
transparent, but expressed in OCL in such a way as to allow co-operative design using 
heterogeneous tools. Hence, iterative design can be achieved with repeated exchange 
backwards and forwards between tools. 

From the work presented it can be concluded that complex transformations that are 
present in modern CASE tools can be achieved using a modest extension of OCL. 
OCL is already used to express constraints in UML models, and can be used to 
express basic transformation constraints without extension. The extended language 
can be used to represent the actions used to transform a model, thus supporting 
iterative design in a context of model interchange. The transformations are 
represented in a UML-based meta-model, allowing for arbitrary extension of the 
transformation options available. Through the use of UML, transformations can easily 
be interchanged between different tools or tool sets using the standardised interchange 
language, XMI, which provides mechanisms to allow UML-compliant models to be 
interchanged using the XML language. 

The example rules used in this paper have all been entered into the prototype 
repository and used for the transformations shown. However, there is not yet user 
support for the extension. Whilst every effort has been made to represent these rules 
accurately in OCL, this does present a margin for error. An early goal is to provide a 
user support for the implementation to save the many hours of work currently 
necessary to construct each test. 

Even though the example meta-model presented in this paper contains many 
advanced concepts, many more advanced extended ER models exist. For example, the 
current meta-model does not support models with n-ary relationships and attributed 
relationships without modification. A further step towards verification of the approach 
would be to extend the model to support more of these features. Another step would 
be to examine whether the approach can be utilised for other types of models, such as 
process models, user interface models and object-oriented models. 

The OCL extension was implemented only for a single environment, and to verify 
that this extension can easily be made to other implementations of OCL interpreters, 
porting of it could be performed to other environments and languages. 

The XMI meta-data interchange standard provides a set of rules to facilitate the 
mapping of UML models to XML documents. The task of actually creating XML 
documents from the UML models created for this paper could be considered trivial, 
but should be undertaken. 

In order to create an even more powerful interchange environment, work is 
underway to create a complete repository environment with support for the extended 
OCL language, allowing several different tools to share the same set of 
transformations and model constraints at the storage level. 
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Abstract. Object-oriented programming languages (OOPLs like C-l-l-, Java, etc.) 
have established themselves in the development of complex software systems for 
more than a decade. With the integration of object-oriented concepts, object- 
relational database management systems (ORDBMSs) aim at supporting new 
generation software systems better and more efficiently. Facing the situation that 
nowadays more and more software development teams use OOPLs ‘on top of 
(O)RDBMSs, i. e., access (object-)relational databases from applications 
developed in OOPLs, this paper reports on our investigations on assessing the 
contribution of object-relational database technology to object-oriented software 
development. First, a conceptual examination shows that there is still a 
considerable gap between the object-relational paradigm (as represented by the 
SQL: 1999 standard) and the object-oriented paradigm. Second, empirical studies 
(performed by using our new benchmark approach) point at mechanisms, which 
are not part of SQL: 1999 but would allow to reduce the mentioned gap. Thus, we 
encourage the integration of such mechanisms, e. g., support for navigation and 
complex objects (structured query results), into ORDBMSs in order to be really 
beneficial for new generation software systems. 



1 Motivation 

Object-oriented programming languages (OOPLs), such as C-I-H-, Java, SmallTalk, etc., 
have established themselves in the development of complex software system for more 
than a decade. Both, the structure of these systems as well as the structure of objects 
managed by these systems have become very complex. Object-oriented concepts 
offered by OOPLs are well suited for managing complex structured objects. However, 
there are additional requirements, such as persistence and transaction-protected 
manipulation, which can only be fulfilled efficiently by integrating a database 
management system (DBMS). Consequently, database technology becomes one of the 
core technologies of modern software systems. In times of the ‘breakthrough’ of 
object- oriented system development, two kinds of DBMSs were of practical 
relevance: relational DBMSs (RDBMSs) and object-oriented DBMSs (OODBMSs). 
Using OODBMSs has proved inefficient/inflexible for reasons we cannot go into in 
this paper. Consequently, OODBMS did not gain wide acceptance [5], and, therefore, 
will not be further considered in this paper. 

B. Read (Ed.): BNCOD 2001, LNCS 2097, pp. 89-104, 2001. 
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Using an RDBMS, on one hand, requires to overcome the well known impedance 
mismatch [5, 13], i. e., performing the non-trivial task of mapping complex object 
structures and navigational data processing (at the OOPL layer) to the set-oriented, 
descriptive query language (SQL92), which supports just a simple, flat data model. 
Despite this considerable mapping overhead, mature RDBMS technology (index 
structures, optimization, integrity control, etc.), on the other hand, contributes to keep 
the overall system performance acceptable. Several commercial systems [2, 12, 14, 
15, 17, 19] mapping object-oriented structures onto the relational data model are 
currently available. Such systems are often referred to as Persistent Object System 
built on Relation (shortly: POS). 

The object-relational wave [22] in database technology has decisively reduced the 
gap between RDBMSs and OOPLs. Although object-relational DBMSs (ORDBMSs) 
are able to (internally) manage object-oriented structures (see data definition part in 
[20, 21]), the required seamless coupling of OOPLs and ORDBMSs is not yet 
possible, because (as in SQL92) results of SQL: 1999 queries (see data manipulation 
part in [20, 21]) are rather (sets of data) tuples than (desired sets of) objects. In 
summary, the gap between OOPLs and ORDBMSs can be traced back to a whole 
bunch of modeling and operational aspects, as we will detail in the following sections. 
Furthermore, the SQL: 1999 standard and the commercially available ORDBMSs 
differ very much in their object-oriented features. Thus, it is by no means clear, how a 
given object-oriented design can be mapped to a given ORDBMS (most) efficiently, 
or which features should be offered by ORDBMSs in general in order to enable an 
efficient mapping of object-oriented structures, respectively. 

Our long-term objective is to influence the further development of ORDBMSs 
towards a better support of object-oriented software development (minimal mapping 
overhead). Thus, we have proposed a new benchmark approach in [26] allowing to as- 
sess a given ORDBMS by taking into account both, its own performance as well as the 
required mapping overhead. Furthermore, [26] presents basic comparisons of purely 
relational and object-relational mappings. This paper goes beyond [26] in that the 
focus is to point up new directions of ORDBMS development, which, as is proved by 
corresponding empirical examinations, object-oriented software development can 
leverage from and, therefore, should be further pursued. Thus, this paper is structured 
as follows. A conceptual examination (section 2) outlines how the object-relational 
data model (standardized by SQL: 1999) corresponds to OOPLs. Section 3 discusses 
corresponding mapping rules. Afterwards, section 4 gives a brief introduction into our 
benchmark approach needed to interpret the measurement results detailed in section 5. 
These results show that object-oriented software development can leverage from 
object-relational technology (in comparison to purely relational technology), but that 
further improvements can be reached by a better support of navigational access 
(retrieving objects by object identifiers) and appropriate mechanisms for retrieving 
complex structured sets of objects. We propose to admit corresponding mechanisms 
for navigation and complex object support to future versions of the SQL: 1999 standard 
as concluded in section 6. 




The Real Benefits of Object-Relational DB-Technology 



91 



2 Conceptual Consideration 

There is a multiplicity of object data models, for example ODMG [9], UML [24], 
COM [2], C-H- and Java. All these models support the basic concepts of the 00 
paradigm, however, there are certain differences. Independently from the modelling 
language used in the 00 software development (e. g., UML), SQL: 1999 must be 
coupled with a concrete OOPL. In accordance to their overall relevance and 
conceptual vicinity, we concentrate on the object model of C-i-H- and ODMG and 
compare it with the SQL: 1999 standard [11, 20, 21]. 



2.1 Modelling Aspects 

Object Orientation in OOPL. The concept object represents the foundation of the 
00 paradigm. An object is the encapsulation of data representing a semantic unit 
w. r. t. its structures/values and its behaviour. It conforms to a particular class [16]. In 
fact, a class implements an object type {classification) which is characterized by a 
name as well as a set of attributes and methods. Each attribute conforms to a certain 
data type and is either single- valued or set- valued (collection types). Furthermore, a 
data type can be scalar (e. g., integer, boolean, string, etc.) or complex. In the latter 
case corresponding values can be references (association) or objects of other classes 
(aggregation) so that complex structures can be modelled. A class may implement 
methods (behaviour) which can be invoked in order to change the object’s state. 
Classes may be arranged within class hierarchies. A class inherits structures and 
behaviour from its superclasses (inheritance), but may refine these definitions 
(specialization). Due to space restrictions we do not give a more detailed description 
of the 00 paradigm, but discuss how the OR data model conforms to 00 concepts. 

Object Orientation in SQL: 1999. While the relational data model (SQL2) did not 
support semantic modelling concepts sufficiently, in SQL: 1999 the fundamental exten- 
sion supporting object-orientation is the structured user-defined data type (UDT, [11]). 
UDTs, which can be considered as object types, can be treated in the same way as pre- 
defined data types (built-in data types). Consequently, similarly to the type system of 
OOPLs the type system of SQL: 1999 is extensible. UDTs may be complex structured 
and, therefore, may not only contain predefined data types but also set-valued 
attributes (collection types) and even other UDTs (aggregation) or references 
(associations). Obviously, UDTs are comparable to the classes of the OO paradigm. 
However, according to the SQL: 1999 standard a UDT must be associated with a table. 
The notion of typed table, also referred to as object table, allows to persistently 
manage instances of a certain UDT within a table. Each tuple of such a table 
represents an instance (object) of a particular UDT and is identified by a unique object 
identifier (OID) which can be system-generated or user-defined. Besides instantiable 
UDTs, SQL: 1999 also supports non-instantiable UDTs, which conforms to the notion 
of abstract classes in OOPLs. In addition, UDTs may have methods (behaviour) which 
are either system-generated or implemented by users. They may participate in type 




92 



W. Zhang and N. Ritter 



hierarchies, in which more specialized types (subtypes) inherit structure and behaviour 
from more general types (supertypes), hut may specialize corresponding definitions. 
Thus, SQL: 1999 supports polymorphism and substitutability, however, multiple- 
inheritance is not supported. Due to the association of UDTs with tables (see above) 
SQL: 1999 does not support encapsulation and, consequently, there is nothing like the 
degree of encapsulation known from OOPL (public, protected, private). 

2.2 Operational Aspects 

Beside the fundamental modelling aspects discussed so far, we also have to examine 
operational aspects in order to figure out the conceptual distance adequately. The fol- 
lowing aspects are most relevant to our consideration: 

Descriptive Queries vs. Navigational Processing. While OOPL processing is inhe- 
rently navigational, SQL supports a set-oriented, descriptive query language. Both na- 
vigational and set-oriented query processing are important to modern software 
systems. Therefore, ORDBMSs should also directly facilitate navigational processing 
to fulfill this requirement of OO applications. Direct support of navigational access by 
the DBMS would mean that a database object referred to by its OID can be provided 
as instance of an OOPL class. However, to the best of our knowledge none of the 
currently available ORDBMSs directly supports this notion of navigation. 

A naive coupling of 00- 
PLs with descriptive SQL 
requires to issue one or se- 
veral corresponding SQL 
queries (see Fig. 1) to the 
database for processing a 
dereferencing operation, 
e. g., GetObject(Ref), and 
retrieving the requested ob- 
ject from the database ser- 
ver. Such a processing 
scheme will surely lead to 
a bad runtime behavior of the entire system, since the costs of transforming a 
navigational operation to SQL queries, of evaluating these queries in the DBS, and of 
the client/server communication can be very high. Obviously, the lack of DB APIs in 
directly supporting navigational access impairs the system efficiency badly. Thus, 
either direct support for navigation' must be provided or efficient prefetching 
mechanisms exploiting set-oriented database access and, thereby, reducing the number 
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Fig. 1: Bottleneck between OOPL and ORDBMS 



1 In section 5 we will see that one of the commercially available ORDBMS provides 
some basic means for a direct access to objects by OIDs. Measurement results show that 
this is at least a step into the right direction. 
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of database roundtrips must be applied in order to effectively couple OOPLs with 
ORDBMSs. 

Structured Query Results. As already mentioned several times before, OOPLs 
support complex structured object types, especially by the possibility of nesting 
complex data types as well as using collection types and references. We have also 
mentioned previously that these facilities of modeling complex structured objects have 
been integrated into SQL by the SQL: 1999 standard. Unfortunately, because of the 
traditional basic concepts of SQL, complex structures (actually supported both in 
OOPL and ORDBMS) get lost at the DBMS interface, since only (sets of) simply 
structured, flat data tuples can be retrieved. Therefore, if we want to couple an OOPL 
with an (O)RDBMS, it is necessary to separately retrieve simple fragments of 
complex objects by issuing several SQL queries, and then rebuild complex object 
structures at the programming language level (see Fig. 1). The mentioned problem 
even gets worse, if not individual complex objects, but complex structures (object 
graphs) containing numerous related objects interconnected by object references are to 
be selected as units. Obviously, the lack of direct support for complex structured 
objects at the DB API reveals a bottleneck between the two paradigms (see Fig. 1), 
and prevents new generation software systems from exploiting the potential power of 
ORDBMSs most effectively. 

Object Behaviour. Of course, the operational aspects also encompass the object 
behaviour implemented in the database. Because of special implementation aspects 
these methods (UDFs) can almost exclusively be executed at the server side, or, if 
these UDFs or special client-invokable pendants are executed at the client side, it 
cannot be guaranteed that these pendants perform the original semantics. For example, 
there may be complex dependencies between UDFs and integrity constraints, e. g., 
referential integrity constraints and triggers, which are implemented by using SQL and 
are automatically ensured by the DBMS. Thus, it is almost impossible to support 
calling UDFs at the OOPL level in the same (‘natural’) way as usually object methods 
can be called. Therefore, we do not consider a mapping of object methods in this 
paper and restrict our considerations to navigational and set-oriented access. 

3 Mapping Rules 

In the previous section, we outlined the conceptual distance between the OO paradigm 
and SQL: 1999. Considering an individual ORDBMS, its OO features determine the 
overhead which has at least to be spent in order to bridge this distance. Nevertheless, 
in theory there is an entire spectrum of possibilities to design the required mapping 
layer. On one hand, it depends on how ’natural’ coding in the OOPL has to remain. 



2 At this point, we want to mention that there are some more aspects of ORDBMSs, 
which OO applications may benefit from, but which cannot be captured in this paper, 
e. g., facilities for integrating external data sources into database processing. 
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and, on the other hand, on how far the 00 features are to be exploited. Regarding the 
first point (‘natural’ coding), we demand that the programmer must not be burdened 
by having to take data management aspects into account. Thus, programming must be 
independent of the database as well as the mapping layer design. Regarding the second 
point (degree of exploiting 00 features), we want to outline the two extremes of the 
mentioned spectrum, i. e., pure relational mapping and full exploitation of the 00 
features offered by the considered ORDBMS’. 

Pure Relational Mapping. As mentioned before, there are several commercial POSs 
mapping 00 structures to relational tables. Objects are represented by table rows. 
Since RDBMSs do not support set-valued attributes, user-defined data types, and 
object references, additional tables are required to store corresponding data and to 
connect them with the corresponding class tables via foreign keys [1]. Thus, several 
tables, may be required to map a given class. Principally, there are several ways of 
representing a class hierarchy in the relational model, comprehensively discussed in 
[8]. After studying pros and cons, we decided to use the horizontal partitioning 
approach (see [8] for details), since it provides good performance in most cases, and is 
also used in most commercial POSs [2, 12]. 

Object-Relational Mapping. Exploiting the OO features of ORDBMSs is commonly 
argued to be more promising [3], but the real benefits in comparison to the pure 
relational mapping are not very well studied yet. This paper will give some 
performance evaluations later on. 

Before outlining general mapping rules exploiting 00 features of ORDBMSs, we 
have to re-emphasize the following point. Our benchmark approach, which will be 
outlined in the subsequent section, assesses a certain ORDBMS by taking the required 
mapping overhead into account. In order to be fair, the mapping layer used throughout 
the measurements must be designed in an optimal way w. r. t. the capabilities of the 
ORDBMS considered. Therefore, the design of the mapping layer may differ with the 
ORDBMSs to be assessed. In the following, we just outline general mapping rules, 
which are based on the SQL: 1999 concepts, in order to provide some basic 
understanding on how a mapping layer can be designed. 

A C-H- class maps to a UDT in SQL: 1999. Non-instantiable UDTs correspond to 
abstract C-H- classes. A UDT is associated with exactly one table {typed table) to 
initialize its instances. Each tuple in this table represents a persistent instance (object) 
of a particular class and is associated with a system-generated OID. Embedded objects 
(aggregation) entirely belong to their top-level object and, therefore, do not own an 
OID. Extents are mapped to the list constructor of C-H- STL. Keys are managed at the 
mapping layer by applying the map constructor of C-H- STL. A C-H- class hierarchy 
maps to a hierarchy of structured UDTs. However, SQL: 1999 only supports single- 



3 Note, SQL: 1999 and the commercially available ORDBMSs differ very much in their 
00 features, as we will see in the subsequent sections of this paper. 
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inheritance so that multi-inheritance has to be simulated at the mapping layer. SQL 
references are mapped to C-H- pointers. Since SQL: 1999 does not support diametric 
references, a relationship type is broken down into two separate primary-key/foreign- 
key connections and the mapping layer maintains the referential integrity. Except for 
mutator and observer methods, which are generated by the mapping layer w. r. t. the 
constraints defined in the user database schema, object behaviour is not yet considered 
in our performance investigation. Navigation is supported by offering the function 
GetObject(Ref) which, in the case that the DB API does not directly support OID- 
based object fetching, is implicitly transformed into an SQL query. 

After having discussed (modelling and operational) discrepancies between 
ORDBMSs and OOPLs as well as the mapping rules needed to bridge the gap, we 
proceed with our performance evaluations. 

4 Performance Evaluation 

Our discussion in section 2 shows that there is only a small difference between the OO 
and OR paradigms w. r. t. modeling aspects, but a considerable distance w. r. t. the 
operational aspects and the application semantics. In order to further evaluate this 
distance as well as to quantify the overhead required for bridging this gap, we propose 
a configurable benchmark approach [18, 26]. 

Remind, we do not consider OODBMSs, but ORDBMSs, because we more and 
more have to face the situation that people are using OOPLs for software development 
and (O)RDBMSs for data management purposes so that there is a need for a more 
detailed examination of the efficiency of possible coupling mechanisms. 
Consequently, the 007-Benchmark [5] representing an important standard for 
benchmarking OO systems, is not appropriate for our purposes. The performance of 
RDBMSs or ORDBMSs has traditionally been evaluated in isolation by applying a 
standard benchmark directly at the DBMS interface. Sample benchmarks [8] are the 
Wisconsin benchmark [3], the TPC benchmark [23] as well as the Bucky benchmark 
[6]. These benchmarks are very suitable for comparing different DBMSs with each 
other [8]. However, none of these benchmarks helps to assess the contributions of a 
DBMS to OO software development. Consequently, these other approaches do not 
take the typical application server architecture and the fact that the DBMS capabilities 
determine the overhead of the required application/mapping layer into account. 
Furthermore, data types as well as operations of the applications we consider may 
differ significantly (double-edged sword [6, 22]), so that a standard benchmark can not 
cover the entire spectrum. Therefore, we propose an open, configurable benchmark 
system allowing to examine the entire system (inch mapping layer) w. r. t. to its 
typical applications. Such a system will also help us to get results transcending those 
reported on in this paper (see succeeding sections), e. g., more detailed examinations 
of navigational support. In the following, we outline our first prototype. Further details 
can be found in [1, 26]. 
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4.1 Benchmark System 

An open, configurable benchmark system is not necessarily difficult to be applied, as 
our approach proves. Indeed, our current prototype offers 3 predefined configurations, 
which w. r. t. database 
size are small, medium 
and large in order to be 
sufficiently scalable. 

Both, structures (data 
type and type 

hierarchies) as well as 
complexity of data in the 
3 standard configura- 
tions are determined in 
cooperation with one of 
the leading software 
vendors for business 

standard software and, 

thus, represent a wide 
spectrum of typical app- 
lication domains. In 

addition, our benchmark 
system can be simply configured according to the particular properties of a concrete 
application. Fig. 2 gives an overview of the architecture of our prototype. Among 
other possibilities, users can directly control the generation process of the benchmark 
database, e. g., specify the database size, the complexity of the class hierarchy and the 
complexity of the individual objects, in order to take care that the special requirements 
of the application in mind are taken into account. After having generated the 
benchmark database the user may select from a given set of query templates (see 
section 4.2 for further details) and indicate how many times each template is to be in- 
stantiated. New query templates can be easily added, if the existing templates do not 
reflect application characteristics sufficiently. Based on these user 
selections/specifications the load generator creates a set of queries which is passed to 
the query executor, which, in turn, serves as a kind of driver for measurements. Users 
can also specify which kinds of measurement data are to be collected by the system, 
i. e., amount of time spent at the DB or the mapping layer for query transformation, or 
the time spent for SQL query evaluation, data loading, and/or result set construction. 
Corresponding values are collected by the data collector during execution of the query 
set and afterwards stored in the DBS for further evaluations. 

As explained in more detail in [1, 26], the special challenges of this benchmark ap- 
proach are, on one hand, to properly take into account the requirements of 00 system 
development, and, on the other hand, to guarantee an optimal mapping w.r.t. the 
particular capabilities of an individual ORDBMS. 
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4.2 Benchmark and Measurements 

In order to get a complete performance evaluation, we concentrate on answering the 
following questions: 

1. Which performance gains offer ORDBMSs in comparison to RDBMSs regarding 
their usage in 00 software development? 

2. Which additional overhead has to be spent at the mapping layer in order to bridge 
the gap between the 00 and OR paradigms and how does it behave facing 
different query types? 

3. To which extent is the system performance influenced by the capabilities of the 
(O)RDBMS API? 

In order to be able to answer the first two questions, we have selected a set of typical 
benchmarking queries according to a long-term study of a leading software company. 
These queries represent a wide spectrum of typical operations in the target 
applications of ORDBMSs. We have compared a purely relational mapping with an 
object-relational one (by means of exploiting its 00 modelling power) by using a 
currently available commercial ORDBMS. This way we ‘measured’ how OO software 
development can leverage from the 00 extensions offered by ORDBMSs (e. g., 
structured UDTs, references, etc.). The operations considered for that purpose are 
implemented as query templates and grouped in following categories: 

Navigation operations: Navigation operations, such as GetObject(OID), are not 
directly supported by almost all currently available ORDBMSs. Considering such 
operations helps us to assess the performance of ORDBMSs in supporting 
navigational processing. We hope that corresponding results ‘help’ ORDBMS vendors 
to make ORDBMSs as efficient as OODBMSs are in this concern. 

Queries with simple predicates on scalar attributes: Queries of this category have 
simple predicates just containing a single comparison operation on a scalar attribute. 
This group mainly serves to provide a performance baseline that can be helpful when 
interpreting results of more complex queries. 

Queries with predicates on UDTs: This group contains queries with simple 
predicates (a single comparison operation) on attributes of structured, non-atomic data 
types. Thus, it mainly serves for assessing the efficiency of mapping UDTs to 
(ORDBMSs. 

Queries with predicates on set- valued attributes: This group contains queries with 
simple predicates (a single IN operation) on nested sets. ORDBMSs directly support 
set-valued data types. In the relational mapping, several tables (according to the 
degree of nesting), which are connected by primary/foreign-keys, are necessary. 
Queries with path predicates: This group contains queries evaluating simple 
predicates after path traversals. These queries allow to evaluate the efficiency of 
processing dereferencing operations (path traversals) in ORDBMSs. 

Queries with complex predicates: Queries of this group contain complex predicates 
challenging both query transformation as well as query optimization. 

Queries on the class hierarchy: While all other queries exclusively deliver direct 
instances of a single queried class, queries of this group deliver transitive instances as 
well. Predicates conform to those of the second category. This group of queries allows 
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to evaluate the efficiency of the ORDBMS in handling class hierarchies (inheritance). 
The comparison with the relational mapping has been expected to demonstrate the 
advantages of ORDBMSs. 

The third question posed at the beginning of this section deals with the capabilities of 
the DB interface especially w. r. t. support for complex structured objects and 
navigational access. In order to examine these aspects, we performed measurements 
on two different (commercially successful) ORDBMSs. One of these systems offers 
the more traditional interface, whereas the second one provides some basic means of 
supporting complex structures objects. 

We performed our measurements'' on a benchmarking database with 100 classes and 
250000 instances (configuration medium). In order to use a representatively structured 
class hierarchy, we studied typical application scenarios of a renowned vendor of 
business standard software and parameterized our population algorithm accordingly. 
We measured the database time (DB time) and the total system time (TS time). The 
DB time of SQL queries is the time elapsed between delegating the queries to the 
DBMS and receiving back the results (open cursors, traverse iterators). It includes the 
time for client/server communication, the time for evaluating the queries within the 
DBS and the time for loading the complete result sets. This has to be taken into 
account, when analysing the measurement results. The TS time is defined as the total 
elapsed time from issuing a query operation at the OOPL level until having received 
the complete result set. It contains the time spent within the mapping layer as well as 
the DB time. 

We think that these 3 questions have to be answered before we can think about, how 
OR technology can be improved in order to support OOPLs better and more 
efficiently. In the following section, we report on our measurement results. 

5 Measurement Results and Observations 

5.1 ORDBMS vs. RDBMS 

In the first test series, we have compared a purely relational mapping with an OR 
mapping by using one of the leading currently available commercial ORDBMSs. This 
investigation aims at quantifying the benefits of 00 extensions offered by ORDBMSs 
in more detail. 

Fig. 3 illustrates the measurement results. Due to space restrictions, it is not possible 
to analyze all results in detail. It can be observed that the OR mapping outperforms the 
purely relational mapping in all query categories. 



4 All experiments in this paper use commodity software. The hardware and the software 
configurations are left unspecified, to avoid the usual legal and competitive problems 
with publishing performance numbers for commercial products. All performance 
measurements are averages of multiple trials, with more trials for higher variance 
measurements. For each DBMS tested, we put much effort in optimization (e.g., 
indexes) and mapping layer design in order to achieve the best performance possible. 
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Although the OR 
mapping shows only 
tiny advantages in 
retrieving small result 
sets, it provides perfor- 
mance gains of up to 
40% in retrieving large 
result sets, or pro- 
cessing queries on class 
hierarchies. The rea- 
sons are twofold. First, 
the 00 features provi- 
ded by the ORDBMS 
contribute to keep the 
complexity of the 
mapping layer low 
(better query evalua- 
tion strategy, less over- 
head for synthesizing 
the result set) and to 
reduce client/server 




Simple queries on scalar attributes with different selectivity 




communication (less Queries on user-defined data types (UDTs) 
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attributes, aggregations 

and (m:n)-relationships 

properly demands several tables interconnected by primary /foreign keys. This, in turn, 
requires to pose several SQL queries in order to perform one query at the OOPL level 
implying a higher DB time and higher communication costs. Compared to a 
semantically equivalent OR mapping, the more expensive query transformation and 
the necessary reconstruction of object structures in the pure relational mapping reduce 
the system efficiency additionally. This effect can even be reinforced if further OR 
features such as function-based indexing or index structures over class hierarchies, 
which can be tailored to OO applications, are used in order to further improve the OR 



mapping. 
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Since in our measurements both client and server ran on the same machine, the 
costs of communication between the database server and the mapping layer are 
relatively small. It is to be expected that a distributed client/server architecture will 
considerably enlarge the difference between these measurement results. Furthermore, 
the more complex the data structures, the more additional overhead is to he expected 
in the relational mapping. As reported in former measurements [6, 25], some OR 
systems performed in some cases even worse than semantically equivalent relational 
systems. Due to our examinations, we think that this statement has to be revised and 
ORDBMSs, in the meantime, have obviously become more and more mature. 

5.2 Performance Characteristics of the Mapping Layer 

In order to characterize the performance of the mapping layer adequately, we have 
investigated simple queries with different selectivity. The results are presented in Fig. 
4. As already mentioned in section 2, the DB APIs of almost all currently available 
ORDBMSs do not support navigational access directly. Hence, a navigational 
operation, such as GetObject(Ref), must he transformed to a database query 
(SQL: 1999), such as "Select * From... Where OID = Ref", by the mapping layer. Such 
a query strategy, especially when intensively dealing with navigational operations as 
usually required by most OO applications, obviously leads to high processing 
overhead spent in the 
database system as 
well as very high 
communication costs 
(over 70% of the 
entire system time. 

Fig. 4, left-hand side). 

The (probably not 
very astonishing) 
observation is that the 
traditional query 
strategy is not ade- 
quate for supporting 
navigational access. 

As we can see at the right-hand side of Fig. 4, the DB time of set-oriented queries 
shows only a slight ascent with increasing result sets, while the additional mapping 
overhead increases rapidly. When retrieving 1250 objects, the time spent at the 
mapping layer even exceeds 86% of the total system time (see Fig. 4, right-hand side). 
This observation can be explained as follows. In the early days of ORDBMSs, these 
systems comparable to RDBMSs were not very successful in supporting navigational 
access, but excellent in processing set-oriented access (as they are still today). 
Unfortunately, OO applications can hardly benefit from this advantage, because the 
ORDBMS API is ‘inherited’ from traditional RDBMSs and, therefore, still only 
supports simple, flat data. In lack of an extensible DB API which may generically 
support complex data types defined by the user, complex objects in an ORDBMS have 
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to be first ‘disassembled’ into scalar values, and afterwards reconstructed (at the map- 
ping layer) to objects of a certain class in the particular OOPL. This kind of overhead 
gets dramatic with increasing result set cardinality and impairs the entire system ef- 
ficiency significantly. 

Regarding these measurement results, we can draw the following conclusions. In or- 
der to be able to support navigational access better, the ORDB API should directly 
support navigational operations like GetObject(Ref), so that the costs of transforming 
navigational operations to SQL queries and for evaluating these queries can he 
avoided. Furthermore, it should also support the notion of complex objects directly 
and offer the possibility of retrieving complex objects as units. According to our 
examinations, such improvements can increase the entire system efficiency by up to 
400%. 

5.3 Support for Complex Objects 

As already mentioned before, 
the lack of direct support for 
complex objects and navigatio- * 
nal access at the DB API level 

5 

extremely impairs the overall 
system efficiency. Fortunately, ^ 

a leading ORDBMS vendor al- ^ 

ready offers an extended call 
level interface, which, as we ^ 
can see later in this section, , 
directly supports navigational 
access as well as retrieval of “ 

... . , 5 ?50 6K 8X 10X 'Wj '^00 

complex objects as units, and, 

in addition, even retrieval of Fig. 5 : Measurement Results III 

complex object graphs as units. 

Navigation is enabled by the possibility of autonomously retrieving complex 
structured objects (by OID) as instances of C structures. This simplifies the mapping 
to OOPLs, such as C-I-H-, considerably and, therefore, is undoubtedly the first step into 
the right direction, although this mechanism does not yet support the actually wanted 
seamless coupling (transparent transformation from a database object to an instance of 
an OOPL class). The mentioned support for complex objects at the level of the DB 
API allows to directly retrieve a complex object’s data from the database into the main 
memory by specifying its OID or a predicate. Therefore, the expensive query 
processing strategy described in section 5.2 can be avoided. Remind that the 
measurements described in section 5.2 have been performed on an ORDBMS that 
does not possess a DB API as the one described in this section. To show the 
importance of and the corresponding demand on a suitable support for complex ob- 
jects at the DB API level, we repeated the measurements described in section 5.2 on 
the ORDBMS referred to in this section and providing the mentioned complex object 
support at its API. Fig. 5 illustrates the measurement results. Obviously, the additional 
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overhead spent at the mapping layer is now independent from the cardinality of the 
query result sets. Thus, the direct support of complex objects at the DB API results in 
a clear performance gain (up to 400%). 

The direct support for 
navigational access at the 
DB API level mentioned 
in this section, allowing 
to directly access objects 
by calling a function like 
GetObject(Ref), avoids 
expensive processes 
(query transformation, 
data types conversion and 
object reconstruction). 

This obviously contri- 
butes to improve perfor- 
mance significantly. Fig. 

6a shows a comparison 
between a query strategy 
(transfor-ming a naviga- 
tional ope-ration to an SQL query) and a navigational strategy (directly calling a 
GetObject(OID) function at the DB API). The advantages of the navigational strategy 
are obvious. With a direct support of navigational access the entire system efficiency 
increases by approximately 200%. 

00 applications often want a set of objects interconnected by object references (ob- 
ject graph) to be retrieved completely within just a single database interaction. Fig. 6b 
shows a comparison of two strategies for retrieving complex object graphs. In this 
measurement, we used the ORDBMS directly supporting navigation as well as 
retrieval of complex object graphs. It can be seen clearly that strategy I exploiting the 
ability of retrieving object graphs exhibits a performance gain of about 100% already 
at a result set cardinality of 13 objects. 

The design of a new DB API, which directly supports complex structured objects, is 
by no means an easy job and requires generic design methods, because user-defined 
data types can be arbitrarily structured, e. g., contain other complex data types, such as 
UDTs, references and set-valued attributes. Furthermore, a DB API has always to be 
multi-lingual requiring to support all common programming languages simultaneously 
and, therefore, making it very difficult to offer the best of both worlds (DBMSs and 
OOPLs) without any compromises. 

6 Conclusions and Outlook 

In this paper, we have emphasized the importance of assessing ORDBMSs w. r. t. 
their capabilities of supporting OOPLs. We have first qualitatively considered 
ORDBMSs and OOPLs regarding modelling and operational aspects relevant for 
object-oriented software development. As the object-relational (SQL: 1999) and the 
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object-oriented data model are essentially coming together, the operational distance 
between these two paradigms is still considerable so that an additional mapping layer 
is necessary to overcome this gap. Regrettably, such an additional layer impairs the 
performance of the overall system considerably. Additionally, we performed 
quantitative examinations (measurements) in order to assess ORDBMSs in their 
capabilities of supporting OOPLs. Indirectly, these measurements are supposed to 
contribute to promoting the optimal utilization of currently available ORDBMSs in 
object-oriented system development and to guide the future development of 
ORDBMSs in a way that the support for OOPLs is improved. 

Regarding our performance examinations, we have motivated the necessity of an 
open, configurable benchmark approach, because not only the performance of 
ORDBMSs themselves but also the additional overhead, which is necessary for 
bridging the conceptual and operational distance between ORDBMSs and OOPLs, 
have to be taken into account and, therefore, properly characterized. It has been clearly 
illustrated by our performance measurements that object-relational database 
technology gets more and more mature, not only conceptually (data model, query 
processing), but also w. r. t. performance. The model facilities contribute to keep the 
mapping layer ‘thin’ in contrast to RDBMSs. This, on one hand, reduces the 
implementation efforts, and, on the other hand, increases the entire system efficiency. 
Despite the available object-oriented extensions, which entail an unambiguous gain in 
comparison to RDBMSs, the potential benefit of object-relational database technology 
in our opinion is not yet exhausted, since the traditional DB API is so far not capable 
of successfully supporting object-oriented principles. The DB API of almost all 
ORDBMSs still can not support navigational operations and complex object structures 
directly so that new generation software systems can not take advantage of object- 
relational database technology in an optimal way. It can be called the ‘bottleneck’ 
between the object- relational and object-oriented paradigms. Our examinations have 
shown clearly that a new DB API directly supporting navigational operations and 
complex objects is necessary. Obviously, it is not easy to equip ORDBMSs with such 
a new interface. For example, such an interface must be multi-lingual. For answering 
the question, how user-defined data types can be effectively represented at the OOPL 
level, further research efforts are required. Altogether, the problem of a seamless and 
effective mapping of SQL: 1999 to OOPLs, such as C-i-H- and Java, has to be worked on 
further. Our future work will mainly be looking for possible solutions. Furthermore, 
we plan intensive studies of ORDBMS capabilities for supporting navigational access, 
because ORDBMSs are still behind OODBMSs in this concern. Generally, after 
having characterized the performance aspects in more details, our long-term objective 
is to develop applicable concepts contributing to increase the performance of 
ORDBMSs. 
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Abstract. This paper is concerned with tracking the evolution of design 
component versions and their related design configuration versions in a 
concurrent engineering design environment. An important aspect is the capabili- 
ty to determine if a dynamically bound configuration version is consistent with 
its design goals and its assembly has no design conflicts between its compo- 
nents’ versions. We present a generalized object-oriented model which captures 
the evolution of design configurations and their components by supporting ver- 
sioning at all levels. The dynamics and consistency of multiversion configura- 
tions are also addressed. 



1 Introduction 

A complex design artefact may be configured from a large number of components 
each of which may evolve over time. A configuration is a set of component designs 
which combine to form a version of the design artefact. A configuration itself may 
evolve over time. This requires that a record of the multiple versions of a 
configuration needs to be maintained to support the Concurrent Engineering (CE) 
design activities which are performed concurrently rather than sequentially in an it- 
erative and tentative fashion. Consequently, configuration versioning enables the de- 
signers to work concurrently on different versions of a configuration as well as to 
rollback to a previous stable state of the design configuration whenever changes in- 
troduce problems. The binding of a configuration version to its versioned compo- 
nent can be static to a specific version of the component or dynamic to a default set- 
ting that can be resolved at run time. Dynamic evolution of a configuration is the 
process of allowing a configuration to evolve while at the same time its components 
are dynamically bound. Furthermore, the consistency in a dynamically bound configu- 
ration needs to be maintained. A consistent configuration is a configuration which is 
composed from a set of components that satisfies design constraints and can be 
grouped together without causing design conflicts. Hence, a multiversion configura- 
tion has to preserve the consistency in each version of the configuration. 

B. Read (Ed.): BNCOD 2001, LNCS 2097, pp. 105-125, 2001. 
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Keeping track of multiple versions of a complex configuration composed of a large 
number of complex components which are also versioned due to their own evolution, 
is a challenging task. This complexity increases in a multidisciplinary CE design 
environment where a large number of participants may be involved in the design proc- 
ess and the design is continually changing due to their interactions. 

In a CE design, the design activities are normally performed using design tools (i.e. 
CAD systems). Although these tools can manipulate the geometrical aspects of the 
design artefact, they lack the support of a powerful management system that can 
integrate and keep track of all the phases and states of a large and complex design 
artefact [15,16]. Therefore, a management tool is essential to keep track of the evo- 
lution and change in the design artefact and its components. 

The basic requirements of a CE design environment supporting configuration ver- 
sion management are the representation of the complex hierarchy of its design objects 
and their relationships, aggregation of primitive design objects to form a higher level 
design object, design concurrency support, version management, configuration 
management, and design consistency management. In this paper, we are concentrating 
on these requirements as they form the basis for a design environment supporting 
design configuration versioning. Conventional database systems, with their emphasis 
on record-based applications, are unable to satisfy these requirements of engineering 
applications. Eor this reason, advanced database systems that can handle such re- 
quirements are needed. Object-Oriented Database (OODB) systems are considered 
capable of satisfying most of these requirements since they possess rich modelling and 
manipulation features such as Generalization, Classification, Inheritance and Aggre- 
gation [1,8,9,10,12,14]. A very important semantic extension in OODB systems is 
version management. The notion of versions is a widely accepted mechanism for re- 
cording design evolution which enables design reuse as well as supporting 
concurrency [1,3,9,10,11,13,14,15,16,23]. Hence, CAD systems are used to manipu- 
late the geometrical aspects of the design object whereas the database system is used 
as a kernel to keep track of the overall hierarchy of design objects (or structures), 
design evolution (or versioning), and the relationships between design objects (in this 
paper, the terms object and design object will be used interchangeably). 

An OODB schema is used to define the design classes which are connected by the 
superclass/subclass (or is-a) relationships that is called class hierarchy. A configura- 
tion class is connected to its components’ classes by Part-Of relationships that is 
called a composite class hierarchy. We will show an example of a Bridge configura- 
tion where the composition of the Bridge components along with constraints on them 
are represented using a composite class hierarchy. The OODB is subdivided into 
design workspaces which are used to store and manipulate designs and their versions 
at different levels of authorization and maturity [7]. A workspace can be private or 
shared. A private workspace is owned by a specific designer who is utilizing it for the 
development of a design. Only the owner of a private workspace can read, modify, or 
delete its contents. A shared workspace, on the other hand, is common among different 
designers who can read or deposit their designs in it. However, there should be a veri- 
fication mechanism which allows only mature (or complete) designs to be deposited in 
a shared workspace [16]. A shared workspace represents an interaction area between 
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the designers of a specific project and only accessible by the designers in that project 
and access is granted by a project manager. The checkin and checkout operations are 
used to deposit and retrieve design objects to/from workspaces. Some commercial 
OODB products support private and shared workspaces such as ITASCA and 
VERS ANT [3]. 



CAD Systems 

□ □ □ 




Fig. 1. The architecture of CE design support environment 

Fig. 1 shows the architecture of the support environment for the versioning of a CE 
design. In the Figure, the design artefact is translated from the CAD system into a 
STEP EXPRESS language representation. EXPRESS is a modelling language of the 
STEP (Standard for the Exchange of Product data) which is used to model engineering 
data. This setup is used to establish a uniform interface to heterogeneous CAD sys- 
tems which may be used in the CE design process. The design constraints are defined 
in EXPRESS language using a set of rules and other primitive constraints such as 
value uniqueness of an attribute. These constraints along with the design artefact 
representation are translated into a C-H- binding which maps them into the corre- 
sponding design classes in the OODB where the constraints can be triggered by the 
database events [2]. Hence, the design constraints are defined in the STEP EXPRESS 
layer and not in the database. Analysis tools, such as a structural analysis tool, are also 
interfaced with the OODB system using C-H- binding and automatically update the 
calculations defining the properties of the design object if its geometrical properties 
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are changed ( e.g. load bearing of a beam). In [7] we showed how a STEP representa- 
tion of a CAD object along with its analysis data are stored and manipulated in an 
OODB system. We have shown also how versions of a simple design object and their 
simple (or non-versioned) design configurations are stored and manipulated in an 
OODB system which is subdivided into private and shared workspaces. However, 
dynamic configuration versioning is not supported in that limited model. 

In this paper, we discuss design evolution, dynamics, and consistency in a CE de- 
sign environment supporting configuration versioning. The remainder of this paper is 
organized as follows. In Section 2, the related work is reviewed. In Section 3, version 
management in OODBs is discussed. In Section 4, an object-oriented model for con- 
figuration versioning is presented. In Section 5, the dynamic and consistency issues in 
configuration versioning are addressed. In Section 6, concluding remarks and future 
work are given. 



2 Related Work 

Version management has received considerable attention in the literature with the aim 
to maintain a record of the evolution of an object. The deriving motivation for this 
interest is to match the requirements of engineering and software environments that 
have an iterative and tentative nature [1,4,9,12,14,16]. The version management of 
individual objects is covered adequately in the literature [9,18,16,19]. However, less 
attention has been directed to the versioning aspects of an object that has other objects 
as components (i.e., a configuration or composite object version management [9,16]). 
In this section, we review the work which addresses configuration version manage- 
ment within an engineering design context. 

Kim et al. [22] presented a model of versions of a composite object or 
configurations. A set of rules was introduced to capture the semantics of versions of 
a configuration. Although the model shows the derivation of versions, it does not ad- 
dress explicitly how the version set is organized (i.e. linear, tree, graph). A limited 
discussion of the dynamic aspects in configuration versions is presented, but the con- 
sistency of the versions of a configuration is not discussed. Ahmed and Navathe [11] 
proposed an approach for the management of configuration versions which are classi- 
fied into intrinsic interface and non-intrinsic internal assembly. Although disconnected 
version graphs are used to organize the version set of simple objects, the organization 
of the version set of a configuration is not clearly addressed. The dynamics in configu- 
ration versions is not clearly addressed and the consistency of versions of a configu- 
ration is also not discussed. Cellary and Jomier [18] proposed an approach to maintain 
the consistency of versions by allowing multiple versions of the whole database to 
co-exist. That is, each logical database version maintains a record of object versions 
that "go together" (i.e. are consistent with each other). To minimize redundancy, 
database versions may share similar versions of an object. Version stamps (a unique 
version identifier) are used to link each object and configuration version with the data- 
base versions it belongs to. Although a tree structure is used to organize the set of 
database versions, the model shows flat object versions and no clear relationship 
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between these versions is identified. The maintenance of these multiple database ver- 
sions will cause substantial overheads particularly in a change-intensive design envi- 
ronment. The dynamic aspects in configuration versions are not thoroughly discussed. 

Some proposals such as [10,20] discuss version control for complex artefacts in 
engineering design but a concrete underlying data model is not clearly specified, and 
identifying the capabilities of a database system in the design environment is not 
addressed. Other proposals, such as [12], have a limited discussion on configuration 
versioning aspects or do not address configurations, concentrating only on simple 
object versioning [23]. Thus, most of the proposals reviewed concentrate on the basic 
versioning aspects of a simple object which is not composed of components and then 
extend the basic versioning to include versions of a configuration. However, this is 
not adequate in an engineering environment where the versioning system is expected 
to support the overall engineering design process. Hence, the emphasis on the whole 
artefact’s (or configuration) versions is essential. 

Our literature review supports our contention that support for design configuration 
version management is embryonic and pays little attention to the problem of 
consistent configuration management. Furthermore, the mechanisms for maintaining 
the dynamics in configurations are not thoroughly discussed and the concurrency 
support between designers using configuration versions is not addressed. In the 
following sections, we will show how these aspects can be addressed. 



3 Version Management in Object-Oriented Databases 

3.1 Simple Object Versioning 

A simple object is an object which does not contain other objects as components. 
Simple object versioning is the creation of a new version of a simple object by 
amending it in some way. That is, the object state before the change is retained as well 
as the new state, allowing multiple versions of object states to co-exist. In general, the 
mechanism for object updates in a system with version management support can be 
shown as follows: 

Vi >Vi + l >Vi + 2 

Where V i is an object version, i is a version number, and a and (3 are changes to the 
object. Note that a and (3 are user actions in the CAD interface on a version of object 
data (or object state) which lead to the creation of a successor version of this object. 
The process of version derivation can be linear or non-linear. The linear version man- 
agement scheme is a very simple scheme as it only allows change to the current ver- 
sion of the design object. This scheme keeps a history of changes to a version in a 
sequential order as shown in Fig. 2. In the Figure, Vq is the original version which is 

created at time instance tq. The first change occurs at time instance t j and version V ^ 
is created as a result of the change. The most recent change has occurred to the ver- 
sion at time instance tjj, when the current version Vjj is created to store the change to 
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the design object. The set { tg, tj, t 2 ,. . tjj} of time instances is a temporally ordered 
set, i.e., tj < tj^_ j for all 0 < 1 < n. 



to tl t2 tn 




Vo Vi V2 Vn 



Fig. 2. Linear version management scheme 



The non-linear version management scheme, on the other hand, is more general and 
flexible than the linear version management scheme. This scheme allows a designer to 
change any previous version to create a new version. A tree data structure is suitable 
for this scheme as from any node of the tree a new node can be generated. A tree data 
structure is also used in [16] to manage versions of an object. In our model, this 
structure is extended to a graph to allow version merging (i.e. more than one parent 
of a version). Hence, a versioned object consists of a hierarchy of versions called a 
version-derivation graph (VDG) connected by the Derived-From (or Parent/Child) 
relationships. If a new version of an object introduces cycles in the VDG, then the 
version is rejected as the new version already exists. New versions of an object need 
not be generated in a strict linear time sequence. It is possible that two or more ver- 
sions (alternative versions) of an object are derived from the same parent version. This 
can be used to experiment with different designs of an artefact. It is also possible to 
have an object version derived from two or more existing versions of an object (ver- 
sion merging). This can be used to produce a version of a design artefact that inte- 
grates different features from its parent versions, possibly developed by different 
designers. In our system, this is achieved by the designer combining the required fea- 
tures from the versions in the CAD interface workspace. Fig. 3 shows an example of 
alternative versions v2 and v3 which are derived from version vl, whereas version v4 
is derived from merging versions v2 and v3. Every version of an object has its own 
unique identifier that distinguishes it from other versions of this object. Since object 
versioning allows multiple states of an object to co-exist, a mechanism is needed to 
represent an abstract object which is the versioned object. This is facilitated by the 
notion of a generic object. The generic object maintains information about the ver- 
sioned object which includes version counter, default version, and the VDG. Each 
version in the object versions set is connected to the generic object by the Version-Of 
relationship. 

Object versioning is supported in many commercial OODB systems such as 
GemStone, ITASCA, O^, Objectivity, and VERSANT [3]. 
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Fig. 3. A graph of version derivation 

3.2 Configuration Versioning 

A complex design artefact is normally decomposed into subcomponents to facilitate 
the design activities. Each subcomponent may, in turn, be subject to further 
decomposition recursively. To map this requirement into the object-oriented model, a 
composite object may be used. A composite object is an object which is composed of 
a set of other objects rather than only primitive domains. A PART-OF relationship is 
defined to link a composite object with its component objects. A configuration is a set 
of component designs which combine to form a version of the design artefact. Hence, 
a configuration may be considered as reflecting a particular state of a composite ob- 
ject and , therefore, can be treated as a versioned object. A configuration is normally 
represented as a Composite Class Hierarchy (CCH) whose nodes, other than the root, 
are the classes of components of the configuration and whose arcs represent the Part- 
Of relationships [5,9]. Fig. 4 and Fig. 5 show class definitions and a CCH of a sim- 
ple Bridge configuration. The Bridge is composed of three components: Founda- 
tions, Substructure, and Deck. Each component is composed from other components 
recursively. This is shown by dotted arrows. The constraints imposed on the Bridge 
and its components are shown in each class definition. For example, the maximum 
load on the Bridge is 42 tonnes. These constraints are defined in the STEP EXPRESS 
layer, as mentioned in Section 1, and mapped into the corresponding classes of the 
configuration in the OODB using C-H- binding. If a design artefact introduces a con- 
straint violation when grouping its components to form a configuration, the designer 
is warned about the conflict and the configuration components are identified as being 
inconsistent with that configuration version. Consequently, the consistency descriptor, 
which is attached to the configuration and each of its components, is updated accord- 
ingly in order to identify whether the component is consistent with a configuration. 
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The consistency descriptor will be explained in more detail in Section 5. When a de- 
signer needs to manipulate a Bridge data, he creates an instance of the Bridge class. 
This instance represents the Bridge configuration together with its components. 
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Fig. 4. Class definition of a bridge configuration 




Fig. 5. A CCH of a bridge configuration 



The notion of configuration in some commercial OODB systems, such as O^, refers 
normally to a collection (or set) of versions of the same simple object. This is 
different from the present discussion of configuration. However, we have shown that 





















Object-Oriented Versioning in a Concurrent Engineering Design Environment 



113 



introduction of new classes can be used in these systems to model design 
configurations [8]. 

Configuration versioning is the creation of a new version of the configuration. 
That is, the configuration state before the change is retained as well as the new state, 
allowing multiple versions of a configuration to co-exist. The evolution of a con- 
figuration is represented as a hierarchy called a configuration-derivation graph 
(CDG). A generic configuration represents an abstract configuration which is associ- 
ated with the versioned configuration. The generic configuration contains the con- 
figuration counter, default configuration version, and a CDG. Configuration ver- 
sioning allows the instance of each class in the standard CCH to have multiple ver- 
sions which will be explored in more detail in Section 4. The binding of a configura- 
tion to its versioned components can be static, i.e. to a specific version of the compo- 
nent, or dynamic , i.e. to a default setting that can be resolved at run time. 

The concepts discussed for simple object versioning may be extended to the 
versioning of a configuration. However, keeping track of multiple versions of a con- 
figuration is more complex than the versioning of a simple object. Therefore, without 
careful management of the evolution of a configuration and its components, conflicts 
may arise between the components of a design artefact. Hence, it is essential to iden- 
tify which component version is compatible with which configuration version [13] 
(i.e. consistent configuration). We will discuss in more detail the consistency in 
dynamically bound configurations in Sections. 

Fig. 6 shows a generalized example of a configuration called CF which has 
three components A, B and C, each of which may have different versions denoted as 

a, V, b, , and c, where k, m, and n are the latest version numbers and t; rep- 
t..k’ l..m’ l..n ’ 1 ^ 

resents the timestamp at which the version of a configuration or component is derived. 

At a time instance tj, the configuration is represented as CFj^(a^,b.y,Cj^ 2 ). The number 

associated with a component version represents which version of the component is 
bound to this configuration. A static binding is assumed here. Later, at a time instance 
tg the configuration structure becomes CF 2 (ag,bj^Q,Cj^g). If no configuration ver- 
sioning is supported CF 2 will replace CFp and CFj^ is lost. However, it represents a 

state of the design which may in the future be consulted or reinstantiated as a new 
version. If configuration versions are supported, then CFj^ and CF 2 co-exist and they 

can be return to at any time. 

4 An Object-Oriented Model for Configuration Versioning 

4.1 The Model 

In this section, we present an object-oriented model which supports the semantic 
extension of versions to cope with configuration versions as shown in Fig. 7. This 
model is independent of the content of the design object and can represent a totally- 
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versioned configuration (i.e. both the configuration and its components are versioned). 
This enables the version model to be sufficiently generic [16] so that it can be ap- 
plied to a variety of engineering domains such as the design of automobile, aircraft, 
buildings, electronic products, VLSI, , etc. A basic overview of this model follows. 
Note that some aspects of the model may appear to he rather complex. This is due to 
the inherent complexity of the CE design environment. 



CFi CF2 





Part-Of Relationship 



o Object/Configuration Version 



Fig. 6. Configuration versions 



4.1.1 Model Elements 

The model elements represent both versioned and non-versioned design objects. This 
means that not only the individual components can evolve but also the whole 
configuration can evolve and bind its components in a dynamic fashion. To capture the 
evolution of a CE design, the version sets of both the configuration and its components 
are represented in a compact form using derivation graphs. The model also supports 
non-component objects which may he referenced by a configuration and/or its compo- 
nents. 

A versioned configuration contains a set of components and a set of references to a 
non-component objects. The component can be simple or composite. The composite 
component is called subconfiguration which is composed from other components. 
Hence, a configuration may contain subconfigurations recursively. Moreover, the 
simple and composite components can be versioned or non-versioned. The model at 
all levels of the configuration composition captures this setup. 

Now each element of the model will be looked at in a top-down fashion. The set 
of versions of a configuration is represented as a CDG. A CDG is a DAG which 
shows the evolution of a particular configuration in a compact form. The nodes in a 
CDG represent a configuration version and the arcs represent the Derived-From rela- 
tionship between a pair of consecutive configuration versions. 
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Fig. 7. An object-oriented model of configuration versions 

A Generic Configuration (GC) is associated with each versioned configuration and 
maintains a record of the versions of a design configuration. A GC is shown by dotted 
box in Fig. 7. The set of versions of a configuration in a CDG is shown inside a tri- 
angle. A non-versioned configuration is called a Simple Configuration (SC). It is a 
design object which contains other design objects as components. This is equivalent 
to a composite object and only the latest version is kept (i.e. no versioning). A Simple 
Object (SO) is a design object that is self contained and does not include other design 
objects as components and it can be versioned or non-versioned. The non-versioned 
simple object and configuration are shown as circles in Fig. 7. If the simple object is 
versioned, its set of versions are represented as a VDG. A VDG is a DAG which 
shows the evolution of a particular object in a compact form. The nodes in a VDG 
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represent a version of the simple object and the arcs represent the Derived-From rela- 
tionship between a pair of consecutive simple object versions. The set of versions of 
a simple object is shown inside a triangle. A Generic Object (GO) is associated with 
each versioned object which is shown as dotted box in Fig. 7. A Referenced Object 
(RO) is a non-component object that is referenced by a configuration and/or compo- 
nent. It is shown as a dotted circle in Fig. 7. 

In our model, a versioned component can be further composed from other 
components. In this case, it is considered as a subconfiguration and the mechanism 
used in configuration versioning is applied to it. If a versioned component, on the 
other hand, does not contain components, then it is considered as a simple object and 
the mechanism used in simple object versioning is applied to it. 

The model identifies four basic relationships. The cardinality of the relationship is 
shown on each link: 

1. Configuration-to- Component relationship: This relationship links a con- 
figuration to its components via a Part-Of relationship. This relationship is 
shown in Fig. 7 as double-headed arrows. 

2. Object-to-Versions relationship: This relationship links a design object (i.e. 
configuration/component) to each version in its version set via a Version-Of 
relationship. This relationship is shown in Fig. 7 as single-headed arrow fol- 
lowed by a triangle that contains the version set of the object. 

3. Version-to- Version relationship: A pair of versions are linked using this 
relationship which links an object or a configuration version to its immediate 
successor version in the CDG/VDG via a Derived-From relationship. This 
relationship is shown in Fig. 7 as single-headed arrows inside the triangle. 

4. Configuration/Component-to-Non-Component relationship: This rela- 
tionship Links a configuration or its components to a non-component object. 
This relationship is shown as dotted circle-headed lines in Fig. 7. 

Note that a version of a component may be part of more than one configuration 
version. Hence, a referencing mechanism is required to indicate in each component 
version the corresponding configuration version it belongs to. This is facilitated by a 
reverse reference in each version of a component showing to which configuration 
version it belongs. Furthermore, a component may be part of more than one configu- 
ration, other than the versioned configuration. Again, the reverse reference can be 
used to facilitate component sharing. 



4.1.2 Model Basic Operations 



A design object can be versioned or non-versioned. The following is an overview of 
the basic operations for manipulating design objects identified by their names. The 
object id (OID) is automatically generated by the system which maintains the mapping 
between the OID and the object name. The designer issues these operations using 
CAD system extended tool set which enables him to interact with the OODB server as 
we showed in detail in [7]: 
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- Simple object 

• CREATE <object name> OF <class name> 

{list of <attribute name, value>} 

• MODIFY <object name> WITH 

(list of <attribute name, value>} 

• DELETE < object name > 

- Configuration 



• CREATE < configuration name> OF <composite class name> 

{list of <attribute name, value>} 

COMPONENT {list of <components> } 

• MODIFY < configuration name> WITH 

{list of <attribute name, value>} 

COMPONENT {list of <components> } 

• DELETE < configuration name> 



The operations specific to versioned objects/configurations are: 

- Simple object 

• DERIVE <object name>.<version number> 

FROM {list of parents} 

• DELETE <object name>.<version number> 

• SET DEFAULT <object name > TO <version number> 

- Configuration 

• DERIVE <configuration name>.<version number > 

FROM 

{list of parents (<component name>.<version number>)} 

• DELETE <configuration name>.<version number > 

• SET DEFAULT <configuration name > TO 

<configuration version number > 

The CREATE operation of a simple object is used prior to that of the configuration. 
That is, the list of attribute values in the CREATE operation of a configuration is 
used to assign values to the primitive attributes of the configuration such as the load of 
a Bridge. To assign values to the attributes of the components, the CREATE operation 
must be issued to the component. Eurthermore, It is important to differentiate between 
the creation of an object itself and the creation of a version of this object. The creation 
of an object is the process of establishing new knowledge in the database and adding a 
new object as an instance of a specific class. The creation of a version of an object is 
the process of augmenting an existing object (or an object version) and hence creating 
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a new version. Therefore, a versioned object must be created first using the CREATE 
operation. This will establish the first instance of the object as the root of the VDG. 
Further modification to the object can be done using the DERIVE operation. When 
deriving a new version of a configuration, the component name along with the specific 
version number should be specified. If no version number is specified, then the refer- 
ence will be to the component’s generic object and the component version will be 
resolved at run time using the default setting in the generic object. Version merging 
and alternative (or branching) versions are facilitated in the DERIVE operation. 
When merging versions, two or more parent versions are identified in the list of par- 
ents. For example, the operation DERIVE v7 FROM vl, v2, v5 will create version v7 
by merging parent versions vl,v2, and v5. The semantics of merging versions is the 
responsibility of the designer who selects parts of the CAD drawings to create the new 
version. Alternative versions, on the other hand, are created by using the DERIVE 
operation more than once on the same parent version. For example, the operations 
DERIVE v5 FROM v4 and DERIVE v6 FROM v4 will create two alternative ver- 
sions v5 and v6 from the same parent version v4. The full mechanisms governing 
version merging and alternative versions are beyond the scope of this paper. 

The deletion of a configuration may involve the deletion of all its components 
whose existence depend on it whether they are shared or not. If a component is 
independent (i.e. its existence does not depend on the configuration existence), then it 
is not deleted because it may be referenced by other configurations and/or objects. 
Note that the deletion of a configuration becomes more complicated when versioning 
of a configuration of a design is supported. Moreover, a configuration components 
may be owned by different designers and a deletion of a particular component of a 
configuration may lead to a dangling reference. Hence, the deletion of a component 
should be avoided if a configuration still references it. To assist this feature, each 
component of a configuration may contain a reverse reference showing to which con- 
figuration it belongs as mentioned earlier. It is preferable in a design database to keep 
deletions of designs to a minimum. Configuration versioning may be considered as a 
desirable alternative to deletion, as it introduces a new version of the configuration 
which does not include the deleted component while preserving the references to the 
"deleted component" in the previous version. This is called virtual deletion. 

Fig. 8 shows an example of a Bridge configuration used in an experimental study 
conducted in [8]. The configuration consists of three components: Foundations, 

Substructure and Deck. Each component may be composed from other components 
recursively which is shown as dotted arrows in the figure. The dynamic aspects in this 
configuration are shown as boxes which means that the Bridge configuration version 
v9 is dynamically bound to the Foundations component. This binding is resolved at 
run time to the default setting defined in the generic object of the Foundations compo- 
nent. Likewise, the Deck component includes a dynamic binding to the Beams com- 
ponent. The circles, on the other hand, show a static binding to a specific version of 
the component. 
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O O bject Version ►► Part-Of Relationship 

I G eneric O bject 



Fig. 8. An example of a Bridge dynamic configuration version 



4.2 Configuration Versions in a CE Design 

We differentiate between two modes of concurrency in a CE environment which are: 

Local Concurrency: The concurrency of tasks within a particular product 
development phase such as the design phase and the manufacturing phase. 
Total Concurrency: The concurrency which involves all the product 
development phases from the conceptual design to marketing. 

In this paper, we concentrate on the local concurrency within the product design 
phase. The multiversion configuration model facilitates CE design by allowing 
versions of a configuration and/or its component to be designed simultaneously. 
Hence, each designer can have a version of the configuration that reflects a design 
state. This is done by a DERIVE operation of the required configuration or compo- 
nent. To this end and according to CE principles, the components of a configuration 
may he designed concurrently. Sequential engineering, on the other hand, requires that 
design activities are conducted in a consecutive manner [6]. Therefore, in the exam- 
ple of a Bridge design in Fig. 8, all the components of the Bridge can be designed 
concurrently by supplying each design group with their own version of a particular 
component or even the complete configuration (i.e. the complete artefact), provided 
the authorization to access the corresponding design objects is granted. Obviously, 
there should be a mechanism for coordination and communication between the con- 
current design groups which was discussed in [3]. This allows the Foundation and 
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Deck designers to work at the same time using their own version of the compo- 
nent/configuration. Later on in the CE design process, the design versions can be 
grouped/merged together to form a new complete configuration represented as a new 
version in the CDG. 

5 Dynamics and Consistency in Multiversion Configurations 

5.1 Default Version Resolution 

Since a versioned design object is composed of a generic object and a set of versions 
each of which represents a state of the design object at a time instance, it is necessary 
to resolve dynamic binding to a generic object rather than static binding to a par- 
ticular version in the version set. Furthermore, a versionable design object may be 
referenced by many design objects, therefore, one needs to include two types of de- 
fault selection in the generic object. The first type is a context default version that is 
used for a particular referencing resolution. This default version is set by the designer 
who references the versionable design object. Thus, multiple context default ver- 
sions may co-exist for a versionable component. The second type of default version is 
a generic default version that is used in the absence of a context default version. This 
generic default version is set by the designer who creates the component object and 
only one generic default version can be defined for a component. If no generic de- 
fault version is specified by the designer, the most recent version will be considered. 
The advantage of the context default version assigned by the designer is it allows the 
default selection to be obtained from a set of consistent versions that matches with the 
referencing configuration version. The SET DEFAULT operation which is shown in 
Section 4.1.2 is used to set the generic and context default versions. However, the 
designer who sets the context default version must have an authorization to access the 
corresponding versioned configuration/component. 



5.2 Consistency in Dynamic Configurations 

The consistency between a configuration and its components is one of the key issues 
in a CE design. It refers to the requirement that the design interface between compo- 
nents does not result in conflicts and that the component is within design constraints 
for its properties. This issue becomes more complex in a dynamic environment where 
both the configurations and their components are versioned. Hence, a consistent 
configuration is a configuration which satisfies the consistency constraints imposed 
on it [2]. These constraints are modelled using STEP EXPRESS language [2], which 
was addressed in [3]. This implies that a consistent configuration version is a configu- 
ration which has a version of each of its components that can be grouped together 
without causing design conflicts. The number of component versions may grow rap- 
idly. Hence specifying which version of the component is consistent with which con- 
figuration version is problematic. Note that enforcing design constraints is not dis- 
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cussed here. The concern in this paper is whether the components of a dynamically 
bound configuration version are consistent. However, due to the tentative nature of 
the design process, designers may need to keep inconsistent versions of a component 
which require further refinement. 

To maintain consistency, we introduce the notion of a consistency descriptor. A 
consistency descriptor is an object that is associated with each design object versioned 
or non- versioned. It contains details about the objects that are related to it in terms of 
their consistency in the community of design objects. The descriptor contains a set of 
the following information (one for each referenced object): 

1. Object/version ID of the referenced object. 

2. Consistency state: consistent/inconsistent. 

3. Time stamp: the time of determination of the consistency state. 

4. Relationship: ingoing/outgoing (this indicates whether the design object is 
part of other design objects or if it references other design objects as its 
components. This enables bi-directional consistency relationships that can 
be obtained from either of the related design objects). 

Most of the proposals in the literature decouple the consistency representation from 
the object itself where a static mechanism is used to link consistent objects (e.g. layers, 
surfaces). Therefore, the maintenance of these links impose overheads [18]. The tech- 
nique proposed in this paper associates the consistency information with each design 
object. The advantage of this approach is to dynamically identify the objects in the 
database that are consistent with each other. 

Consistency constraints are automatically checked in multiversion configuration at 
two levels. The first is at the level of a component version. The second is at the level 
of a configuration version. Hence, if the resulting configuration version is consistent 
(i.e. conforms to its design constraint), then the consistency state entry in the consis- 
tency descriptor is automatically updated for the configuration version and each of its 
constituent components to reflect the consistency state (i.e. consistent). If, on the other 
hand, the configuration is not consistent, then only the consistency descriptor in the 
configuration version is updated to show that the configuration is inconsistent. The 
reason for not updating the corresponding consistency descriptor of the components’ 
version to inconsistent state is that when a new component version is derived, then its 
consistency state entry in the consistency descriptor is always set to inconsistent. 
Hence, only when constraint checking at the configuration level reveals a consistent 
configuration is the consistency state of the component version set to consistent. 
Consequently , the system can automatically determine whether the components of a 
configuration version are consistent with each other. 

The binding of a configuration to its components can be static or dynamic. The 
problem associated with the dynamic binding is the rule by which the reference reso- 
lution of a default setting is made. Some proposals consider the default version to be 
the latest version created [14]. Others distinguish the current version from the latest 
and decouple the default from the latest version[16]. Other proposals leave this 
choice to the designer who decides on a default setting [3]. None of these previous 
approaches discusses the consistency issues in dynamic binding. We believe that this 
is a key issue in resolving dynamic binding. Otherwise, major conflicts between a 
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configuration and its component versions may occur. Note that ultimately static and 
dynamic schemes will bind a configuration to a particular component version. The 
question is whether this version is consistent with the configuration. This question is 
difficult to answer in the case of dynamic binding since the version is not known in 
advance as is the case in static binding. A solution to this problem is introduced which 
is based on dividing the space of versions of a component into consistent and incon- 
sistent version subsets. The next step is to decide on a default version for the consis- 
tent subset (i.e the context default). This will ensure that during dynamic binding 
resolution, the default version is only obtained from the consistent subset of versions. 
If the designer decides to ignore the consistency factor, the default version of the 
whole version set (i.e. the generic default) will be considered based on a resolution 
policy (e.g. latest, current, user defined version). In effect, this default version may 
be in a consistent or inconsistent state. Note that the default version resolution in the 
consistent subset of versions may have its own resolution policy which may be differ- 
ent from the policy used in the whole version set. Fig. 9 shows a versioned configura- 
tion which has two components, one is versioned and the other is not. The figure 
shows the VDG of the component versions. Relative to that configuration, the consis- 
tency state of each version is shown as C if the version is consistent with the configu- 
ration and I if the version is inconsistent with the configuration. 

A configuration is said to be consistent if all the states of its components are in a 
consistent state, and inconsistent if at least one of the states of its components is in an 
inconsistent state. Note that a configuration may have components that are in 
consistent and/or inconsistent states which implies that the whole configuration is 
inconsistent. This means that an inconsistent configuration may contain consistent 
subconfigurations. Later on, when all the components reach a stable state and satisfy 
consistency constraints of the design, the whole configuration becomes consistent 
which is shown in Fig. 10. This approach enables a flexible design environment to be 
supported, in which designers are not constrained in their creation of new component 
versions. 

Since the design environment evolves in a continuous manner, the consistency 
states may be subject to change. Therefore, a notification mechanism is needed to 
inform the designers who are referencing a design object that its state now may not 
match the consistency descriptor in their version of the design object. Furthermore, 
when a design object/version is deleted, all the consistency descriptors in the design 
objects referencing it, need to be updated. Flence, the consistency descriptor needs 
continuous maintenance to reflect the actual consistency between design objects 
accurately. The consistency descriptor facilitates this process by keeping a record of 
all the design objects/versions that reference an object/version. A lookup/notify 
mechanism can then be applied for each entry in the consistency descriptor. 

6 Conclusions and Future Work 

In this paper, we have presented an object-oriented model for design configuration 
versioning in a CE design environment. The dynamic aspects in multiversion configu- 
rations as well as the consistency between a versioned configuration and its versioned 
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components are discussed. We are extending an earlier work [8] which implements 
the versioning of simple design objects to include configuration versioning and this is 
currently under implementation in prototype form using the Objectivity OODB 
system [26] which is distributed on SUN SPARC Ultra stations running Solaris 7 and 
Windows NT4 stations. 




C: Consistent I: Inconsistent 



Fig. 9. Configuration consistency 




o Object/Configuration Version ►► Part-Of Relations 



C: Consistent Version I: Inconsistent Version 



Fig. 10. Configuration consistency states 
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One of the important aspects in managing configuration versions is the change 

propagation and its effect in automatically generating new configuration versions. 

Another important aspect is the authorization which governs the access to a configu- 
ration version and its components. The future work includes extending the configura- 
tion versions model to incorporate change propagation and authorization. 
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Abstract. Data abstraction and query processing techniques are usu- 
ally studied in the domain of administrative applications. We present a 
case-study in the non-standard domain of (multimedia) information re- 
trieval, mainly intended as a feasibility study in favor of the ‘database 
approach’ to data management. 

Top- A queries form a natural query class when dealing with content re- 
trieval. In the IR field, a lot of research has been done on processing top-A 
queries efficiently. Unfortunately, these results cannot directly be ported 
to the database environment, because their tuple-oriented nature would 
seriously limit the freedom of the query optimizer to select appropriate 
query plans. 

By horizontally fragmenting our database containing document statis- 
tics, we are able to combine some of the best of the IR and database 
optimization principles, providing good retrieval quality as well as 
database ‘goodies’ like flexibility, scalability, efficiency, and generality. 
Key issues we address in this paper concern the effects of our fragmen- 
tation approach on speed and quality of the answers, opportunities for 
scalability, supported by experimental results. 

Keywords: top-A, indexing, query optimization, content based re- 

trieval, multimedia, databases 



1 Introduction 

Data abstraction is the essence of the ‘database approach’ to data management. 
Specifying the manipulation and definition of data at a high level of abstraction 
provides not only data independence, but also enables a database management 
system to facilitate a wide variety of additional ‘goodies’, including efficiency, 
scalability, and flexibility. 

B. Read (Ed.): BNCOD 2001, LNCS 2097, pp. 1 2fi- m 2001. 
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The importance of services such as transaction management and concurreney 
eontrol in administrative applications - and their excellent support in commer- 
cial relational database management systems - has resulted in the database 
approach being the norm in the domain of business applications. Unfortunately, 
the benefits of the database approach to data management are not so well es- 
tablished in most other application domains. 

This paper demonstrates how two key elements of the database approach (i.e. 
data abstraction and query optimization) can play a similarly important role in 
the domain of information retrieval (IR). In general, IR systems are not very 
flexible, in the sense that e.g. changing the retrieval model (ranking algorithm) 
is far from trivial. Also, the physical access and storage structures, although 
often quite sophisticated, are ‘hard-coded’ into the system. As a result, it is 
not so easy to turn an existing stand-alone IR system into a parallel and/or 
distributed system. Neither is generality a common property among IR systems; 
most systems support document retrieval by content only, and not by other 
attributes such as author, category, or publication date. In any case, it will be 
very difficult to add extra attributes in a later stage. 

In contrast to the conventional approaches in the IR field, we do not tie our 
IR retrieval model onto a physical data structure like inverted files, but specify 
the model declaratively at a high level (allowing flexibility in the choice for re- 
trieval model) as described in j I t>l I I j . The particular advantage of this approach 
is that it allows us to extend our research DBMS with IR techniques, without 
breaking the set-oriented nature of query processing. Such a combination of IR 
techniques with traditional database technology is an important ingredient for 
the development of search engines for large collections of XML documents that 
allow queries on the combination of structural properties and field values (tradi- 
tional data retrieval) as well as their content (requiring the information retrieval 
techniques). Also, as motivated in 0, and demonstrated in at VLDB’99, the 
same techniques provide a strong foundation for the implementation of multi- 
media retrieval systems. 

The main objective of our paper is to (1) present a case-study demonstrating 
how data abstraction is equally useful in non-traditional application domains as 
in the administrative domain, and (2) motivate why a novel DBMS architecture 
is required to facilitate such broadening of core database technology for new 
domains. We start with a state-of-the-art IR retrieval model, that performs very 
well on retrieval evaluation experiments (see (El). We then present the inte- 
gration of these algorithms in our research DBMS, which resulted in a system 
sufficiently powerful to participate in trecQ IE]. This paper discusses the de- 
velopment of new query processing techniques at the logical level of the DBMS, 
improving its efficiency and scalability on IR query loads. These techniques are 
applied transparently in the mapping from abstract specification to implementa- 



^ The TREC, Text Retrieval and Evaluation Conference, is a well known IR confer- 
ence, organized annually by NIST in the US (http://trec.nist.gov/). A key part of 
the conference submissions involve the benchmarking results of retrieval systems. 
To support this, standardized sets of documents and queries are provided by the 
conference organization. 
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tion, thus retaining the flexibility to combine queries on content with the usual 
data retrieval, and/or experiment with novel IR models at a declarative level. 

The remainder of the paper is organized as follows. First, we outline the 
intuition underlying our optimization strategy in Section El and introduce the 
MirrorDBMS prototype in Sectional Next, we describe the proposed query pro- 
cessing techniques in Section 0 Section 0 outlines the experimental setup to 
evaluate these techniques, and Section El presents and analyzes the results of our 
experimental evaluation. Section [7| presents the conclusions and future work. 

2 Problem Statement 

In IR systems, users express their information needs using a small number of 
keywords and relevance judgments on previously retrieved documents (called 
relevance feedback). Similarly, querying multimedia objects requires the user 
to specify characteristics of the content, e.g. by describing a color histogram of 
desired images. However, for sake of clarity, we limit the scope of our discussion to 
the ranked retrieval of text documents. Using the query and relevance feedback as 
an approximation of the real information need, the system then selects objects 
with characteristics ‘similar’ to those specified in the query. Notice that this 
comparison between objects is usually based on metadata extracted from the 
original documents, such as words occurring in the texts. 

In the straightforward implementation of this process, each interaction step 
between user and system involves ranking all objects based on their similarity to 
the query, although only the N ‘most’ similar objects are presented to the user 
(the top-N objects). Obviously, top-iV query optimization (attempting to com- 
pute only the similarity for the top-iV documents to be presented) is a natural 
step to improve the efficiency of (multimedia) information retrieval. 

As mentioned briefly, content querying is interactive and iterative: after re- 
viewing, a user gives an indication of the quality of the answer (relevance feed- 
back) which is used to modify the original query. Processing the modified query 
generates new answers, and so on. Thus, it seems particularly interesting to cut 
off query processing at a reasonable stage, and show the results computed till 
then to the user for relevance feedback. Although the quality of the answer may 
be impeded, this may allow for great reductions in computations, possibly with- 
out diminishing the effectiveness of the relevance feedback process. Our intuition 
is that incomplete answers can still provide a good basis for relevance feedback: a 
quick approximate response may still provide sufficient information to refine the 
estimate of the user’s information need, and consequently improve the effective- 
ness of retrieval. For each iteration, the users may decide for themselves whether 
they prefer quicker responses of (generally) lower quality, or slower responses of 
(hopefully) higher quality. 

In handling top-A queries more efficiently, implementations of IR systems 
have drawn on a combination of domain knowledge - exploiting the Zipflan m 
distribution of terms found in documents (see e.g. m - with smart element- 
at-a-time cut-off operations derived from the ranking function, like in the im- 
plementation of the well-known INQUERY retrieval system m- Exploiting the 
element- at- a-time manner of processing, highly accurate cut-off conditions can 
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be updated after evaluating each element, allowing for efficient reduction of 
obsolete intermediate results being computed. This algorithm and comparable 
ones (e.g. |H|) exploit a carefully designed ordering of the data, mathematically 
well-founded by the work of Fagin mmm- 

Obviously, the element-at-a-time nature of these native IR algorithms for 
top-A^ query processing reduces significantly the number of possible query plans 
under consideration in the query processor of a DBMS combining IR techniques 
with data retrieval; which is unacceptable in many cases. So, the goal in this 
paper is to devise a database approach that is able to satisfy the following goals: 

— Improving the efficiency and scalability using (a) domain knowledge and (b) 
new techniques inspired by the (element-at-a-time) cut-off operations, and, 

— Maintaining the flexibility and generality of the database approach to IR. 

Summarizing, we have identified two potential approaches to improve effi- 
ciency in information retrieval query processing in a database environment: on 
the one hand, we may reduce the amount of work by ranking fewer documents, 
and on the other hand, we may take advantage of computing only partial answers 
in the first iterations of the retrieval process. In the remainder of this paper, we 
will demonstrate how the database approach allows us to exploit fragmenta- 
tion of the metadata to achieve these ideas in a simple yet effective manner, 
requiring only minimal changes to the original, declarative specification of the 
IR retrieval model. 



3 IR Query Processing in the MirrorDBMS 

The architecture of the MirrorDBMS, our research prototype, consists of two 
layers: the logical layer, based on Moa object algebra 0, and a physical layer, 
realized by the binary relational main-memory DBMS Monet m- The query 
processor transforms the algebraic query expressions specified in Moa into the - 
highly efficient - physical operators offered by Monet. The distinguishing feature 
of the MirrorDBMS is that it is extensible at all levels of its architecture: enabling 
the encoding of domain knowledge and advanced query processing techniques at 
the logical level as well as the physical level (please refer to |0l Chapter 2] for more 
information). Also, the prototype DBMS is well prepared for scalability, as Moa 
supports shared-nothing parallelism, and shared-memory parallel computing is 
supported by Monet at the physical level. 



3.1 The IR Retrieval Model 

A retrieval model specifies how the similarity between a document and the query 
is computed, given the query and the relevance feedback from one or more pre- 
vious iterations. Most probabilistic IR models rank the documents based on two 
parameters: 

term frequency: For each pair of term and document, tf is the number of 
times the term occurs in the document. 
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TF(term, document, tf) 


(1) 


IDF(term, idf) 


(2) 


Q(term) 


(3) 



Fig. 1. Relations 



inverse document frequency: For each term, idf is the inverse number of 
documents in which the term occurs. 

In a more database like notation we can describe these two statistics as the 
relations m and El respectively, as shown in Figure ^ 

Furthermore, we introduce the set of query terms, i.e. the unary relation 0 
shown in Figure Q 

In most probabilistic retrieval models, the ranking of a document given a 
query is almost completely determined by the sum of the product between the 
tf and idf of the query terms occurring in the document, sometimes normalized 
with the document length in one way or another. See for more detailed 
information about our specific ranking formula and retrieval model. 



3.2 Query Processing in an IR System 

The algorithm shown in Figure El in pseudo-code, consisting of three parts, 
sketches how a typical IR system computes its query results from these tables in 
a nested-loop manner, thus determining precisely the physical execution order. 
We will assume that IDF is ordered descending on idf. 

It may be clear that this algorithm allows a very efficient cut-off, but also is 
highly inflexible with respect to execution order. 



3.3 Set-Oriented IR Query Processing 

To reduce this inflexibility of the IR retrieval process, and thus enable smooth 
integration with traditional DBMS query processing, we reformulate this algo- 
rithm at a higher, declarative levelfl 

In the implementation, the information retrieval techniques are supported by 
extensions at both levels of the Mirror DBMS. While the exact ranking formula 
requires some minimal extensions at the physical level, the set-oriented formula- 
tion of IR query processing is almost completely modeled at the logical level as 
a Moa extension. Although the real algorithm is specified using the powerful but 
relatively low-level Monet Interface Language (MIL), we prefer an SQL-flavored 
syntax for didactic reasons. 

Again, the algorithm, as shown in Figured consists of three parts. 

^ For the impatient: the usefulness of this seemingly minor step will become more 
clear when we present our optimization techniques based on data fragmentation in 
Section 0 
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Part A Limit the TF and IDF to match the terms in query Q: 

foreach tl in TF do 

if (tl.term in Q) then 
INSERT tl INTO TFQ 
endif 
end 

foreach tl in IDF do 
if (tl.term in Q) then 
INSERT tl INTO IDFQ 
endif 
end 

where tl denotes the tuple-variable associated to the relations. 

Part B Loop over the terms to compute the ranking contribution per document-term pair, and 
update the document ranking incrementally each time a new ranking contribution for that 
document becomes available. Stop as soon as a test (based on the processed IDFQ and TFQ 
values, knowing that IDFQ can only decrease, cf. CQ) shows that no document could obtain 
a ranking better than the current top-iV. 

Like before, tl and t2 are tuple-variables. Furthermore we assume the existence of a table 
RANK that has two columns: document and rank. 

foreach tl in IDFQ do 

# Find the matching tf values 
TFQsel = f indrecords(TFQ, tl.term) 

foreach t2 in TFQsel do 
tfidf = tl.idf * t2.tf; 
if (t2. document in RANK) then 
updateranking (RANK , 
t2. document, tfidf) 

else 

addranking (RANK , 

t2. document, tfidf) 

endif 

# topN test criterium 
if ( ! topNcanimprove) then 
exitloops 
endif 
end 
end 

Part C Return the top ranking documents: 
i = 0 

foreach tl in RANK 
do 

if (i < N) 
then 

INSERT tl INTO TOPRANK 
else 

exitloops 
endif 
i = i + 1 

end 



Fig. 2. IR query evaluation algorithm 
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Part A Do some initialization given query Q. 

Part B Limit the TF and IDF to match the terms in query Q: 

TFq ^ TF M Q 

and 

IDFq — IDF IX Q. 

Next, place the IDFq values next to the corresponding entries in TFq: 

TFIDFiineup = TFq M IDFq. 

Now, compute the tf • idf value per term-document pair, aggregating the last two columns 
into one: 

TFIDF — SELECT term, document, tf idf 
FROM TFIDFiineup- 

Finally, compute the ranking per document by aggregating all term contributions per doc- 
ument: 

RANK ^ SELECT document, AGGR(tfidf) 

FROM TFIDF 
GROUP BY document. 

Please note that in the actual code, the AGGR()-operator does not exist as one operator, 
but denotes a combination of several functions that together compute the ranking. We 
abbreviated it here for reasons of simplicity. 

Part C Normalize RANK and select the top-A documents: 

TOPRANK ^ TOP(RANK, N). 



Fig. 3. Set-oriented IR query processing 



3.4 Discussion 

The main performance bottle-neck lies in handling the TF table0: TF contains 
over 26 million entries in the experiments performed for this paper - which 
is only about a quarter of the complete TREC data set. Only Part E0 in the 
algorithm described above handles a very large amount of data. At a first glance, 
the pre-selection on query terms, TFq = TF K Q, at the beginning of Part B, 
may potentially reduce the remaining dataset to a manageable size. But, it is 
a well-known fact in IR experiments that for the average query, roughly about 
half of the database remains after that pre-selection: still a very large dataset as 
input for the subsequent computations. 

Since, N = 1000 or (often) less, pushing the top-A operator into the query 
could be very profitable. However, pushing it down the query plan implies push- 
ing it through the AGGR()-operator, and therefore through the tf ■ idf product. 

^ Since we work on a binary model, several tables represent together the columns of 
TF. However, for didactic reasons we will stick to the normal table approach since 
this has no significant consequences for the core of our idea. 

^ From now on, when we refer to Part A, Part B, or Part C, we mean the ones 
described in the database approach and not in the IR approach as described in 
Subsection 
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A generic (set-oriented) mathematical solution for this top-A query optimiza- 
tion problem is not a trivial one, despite its innocent look. In the next Section, 
we therefore propose data fragmentation as another means to prune the search, 
while keeping (the declarative specification of) the algorithm practically un- 
touched. 

4 Data Fragmentation and the Top-AT Query 
Optimization Problem 

Since Monet is a main-memory DBMS, the data used in the hot set should 
always fit in main-memory (to avoid performance degradation due to swapping, 
or even worse, a crash caused by running out of memory). The natural way 
to meet this requirement is to horizontally fragment the TF table into a small 
(yet to be determined) number of suitably sized parts. Such a fragmentation 
strategy is orthogonal to the actual retrieval algorithm, and can be managed in 
the mapping from Moa to MIL. 

In this paper, we elaborate on the use of additional knowledge for choosing the 
fragmentation scheme, speciUc for query processing in the IR domain. We show 
how this enables us to achieve both proposed strategies to improve the efficiency 
of IR query processing: (I) computing partial answers, and (2) top-A query 
optimization. Furthermore, the implementation of the fragmentation strategy 
remains almost entirely orthogonal to the IR retrieval algorithm outlined before. 
Remark that we will focus on optimization techniques at the logical level of the 
MirrorDBMS. 




Fig. 4. Relative document frequency (zoomed on y-axis to show lower values) 
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Restricting query processing to a smaller portion of the metadata is a well- 
known approach to increase the efficiency of IR system implementations by com- 
puting approximate answers. Obviously, this implies that the effectiveness of the 
answer (measured using precision/recall) will degrade: we trade quality for speed. 
To minimize the loss on quality, we exploit the properties of the afore-mentioned 
Zipfian term distribution. The hyperbolic curvature of the document frequency 
plot, shown in Figure El confirms that the data in our test database (see also 
Section ED indeed behaves as predicted by Zipf, validating the underlying rea- 
soning behind our approach. 

4.1 The Fragmentation Algorithm 

Now, Figure El shows the simplified basics of our fragmentation algorithm for 
splitting the data up in two fragments using the additional information about 
the term distribution. To keep the example simple, we take the first fragment 
such that it contains Si • |IDF| of the terms and the second one the other S 2 - |IDF| 
terms, where S 2 = (1 — si) and 0 < Si < 1, * S {1, 2}. 



Step 1 Sort the IDF descending on the idf values, i.e., terms that occur in many documents 
get lower in the list compared to terms that occur in less documents. 

IDF.orted - SELECT * 

FROM IDF 
ORDER BY idf DESC 

Step 2 Create two fragments IDFi and IDF 2 such that 

SELECT COUNT(*) 

FROM IDFi 

SELECT Si - COUNT(=k) 

FROM IDFsorted 

for i G {1,2}. 

Step 3 Create two fragments TFi and TF 2 such that 

TFi ^ TF IX IDFi 

for i G {1,2}. 

Notice that, for si = 0.95, TFi would now contain approximately 5% (not 95%!) of the 
tuples of TF, and TF 2 the rest, due to the high skewedness of the data. 



Fig. 5. Fragmentation algorithm 



Since the terms in IDFi have a high idf their contribution to the ranking 
of a document is likely to be higher than for terms in IDF 2 (having lower idf 
values). In other words, the terms in IDFi are a priori more promising than 
the terms in IDF 2 . Fortunately, these interesting terms only use about 5% of 
the data (in case si = 0.95). So, in case all query terms are stored in the first 
fragment, we only need to compute the results using IDFi and TFi. This would 
mean that the following semijoin TFq = TF M Q in Part B of the algorithm 
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would become TFq = TFi 1X1 Q, which will be significantly faster due to the 
much smaller first operand. 

In case not all query terms are contained in the first fragment, one might 
decide to still compute the results on the first fragment only. This could of 
course result in a different top when too much significant information is ignored 
that way. Some experiments described later in this paper try to determine the 
effects of ignoring the second fragment on the quality of the answer. 

Finally, notice that the fragmentation algorithm described above can easily 
be used to handle different fragmentations, for instance for different relative 
sizes (just choose a different Si) and/or more than two fragments (have i G S, 
with S C {1, 2, 3, . . . , M}, IS”! > 2, M = |IDF|). For reasons of simplicity, it 
is sometimes more practical to join TF and IDF before fragmenting the data, 
and propagate the fragmentation into TF and IDF fragments subsequently. This 
other method is particularly handy to obtain fragments of (almost) equal data 

size El 

4.2 Ftagment-Based IR Query Processing with Top- TV Cut-off 

The algorithm in Figure El shows the top-A'^ cut-off idea in a similar manner like 
the set-based description of the retrieval algorithm as described in Subsection 
O exploiting the fragmentation idea described above. 

This algorithm in fact is a sub-set-at-time version of the element-at-a-time 
version described in Subsection El 

Unsafe Top-N Optimization. Note that this algorithm is a so called unsafe 
top- IV cut-off algorithm El- Top-N query optimization relies on the cut-off of 
the query evaluation at a certain stage when certain characteristics concerning 
the still remaining worljfl provide sufficient evidence that the top-N cannot be 
improved anymore. However, this also means that at the cut-off moment cer- 
tain information, e.g. ranking contributions, have not been taken into account. 
This usually results in a top-N containing the correct documents but with an 
incomplete ranking. In turn, this can result in a different ordering of the top-N. 
Unsafe top-N query optimization stops at this ‘incorrectly’ ordered top-N. 

Safe Top-N Optimization. The safe alternative to the unsafe method does 
indeed return the top-N with the correct ranking values and inherently can 
deliver them in the correct order. To obtain these final ranking values the ranking 
contribution for the documents in the unsafe top-N needs to be computed for 
all fragments that have not been taken into account, yet. This of course will 
(slightly) reduce the profit of top-N cut-off due to the extra work that has to be 
done. 

® The fragmentation process itself is part of the physical design of the database, and 
therefore its performance is not really an issue, at least for mostly static collections. 
® In the algorithm described here the topNcanimprove variable represents the infor- 
mation needed to make the cut-off decision. 
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Part A Similar to Part A in Subsection |^ 2 I S®!- * to the first fragment that contains a query 
term. 

Part B Similar to Part B in Subsection ^21 this time using fragment i instead of the 
unfragmented TF and IDF. Let’s call the resulting ranking RANKi. 

Part B' Merge RANK, into any existing RANK or otherwise set RANK = RANKi. 

Part C Normalize RANK and select the top-A documents: 

TOPRANK ^ TOP(RANK, N). 

Part Compute the lowest ranking in the current intermediate top-A: 

topLB ^ MIN(TOPRANK) 

and the highest ranking in the remaining intermediate results: 
restUB ^ MAX(RANK - TOPRANK). 

Furthermore, compute the highest possible ranking contributions over all fragments j > i\ 

contribUB ^ MAXRCONTRIB(TFIDFj , . . . , TFIDF25) 

and the lowest possible ranking contribution 

contribLB ^ MINRCONTRIB(TFIDFj , . . . ,TFIDF25)- 

Part C" Test whether the top-A still can be improved: 

topNcanimprove = (restUB + contribUB > topLB + contribLB) 

and limit RANK to those documents that still can move up into the top-A: 

RANK ^ SELECT * FROM RANK 

WHERE rank > topLB + contribLB — contribUB 

as soon as: 

MIN(RANK) < topLB + contribLB — contribUB 

and COUNT(TOPRANK) > N and limit all fragments j ~> ito match this new intermediate 
ranking. 

Part C'" If topNcanimprove is true, then find the next fragment i containing a query term 
and return to Part B (in this algorithm) . Otherwise, return TOPRANK and quit. 



Fig. 6. Fragment-based IR query processing with top-Ai cut-off 



Heuristic Unsafe Top-N Optimization. Going even further on the unsafe 
principle, we can drop the requirement in Part C" that the intermediate rank 
only can be restricted when 

MIN(RANK) < topLB -I- contribLB — contribUB. 

The algorithm then becomes even ‘more’ unsafe: as soon as COUNT 
(TOPRANK) > N, documents that have no ranking yet are ignored, even when 
they would have received a high ranking otherwise. In turn, this heuristic unsafe 
method is very likely to achieve a much better performance due to the earlier 
and more restrictive limitation imposed on RANK and all fragments j > i. The 
‘level of unsafe-ness’ can be controlled by adding some documents (with initial 
ranking 0.0) to RANK already during Part A using an a priori notion of rank- 
ing between the documents. These documents cannot be forgotten anymore, but 
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will keep their 0.0 ranking when they do not contain any query terms, thus not 
disrupting the ranking process in case they were wrongly added in advance. 

In our case we control the level of unsafe-ness using a factor I (where 0.0 < Z < 
1.0) to select the I x no. of documents with the highest document length to be 
added in advance. The document length appeared to be an interesting, natural 
measure of a priori document relevance for the IR model we used. However, 
one can think of many other other means to ‘pre-select’ documents that should 
not be ignored (i.e. the documents that are most referenced in a digital library 
case, or most linked to in the web case). Also note that a too high I will cause 
the performance to drop rapidly because of the then extremely high number of 
documents that are forced to be ranked. 

5 Experimental Setup 

In the experimental evaluation of the ideas put forward in the previous section, 
we focus on the following three concrete research questions: 

1. How can fragmentation improve efficiency for top- A query execution? 

a) What are the consequences for the speed? 

b) What are the consequences for the quality of the query results, also 
taking into account the impact of safe/unsafe top- A optimization? 

2. How can fragmentation improve scalability, to either manage the same 
database on smaller hardware (like a notebook), or a larger database on 
the same hardware (such as a search engine for the WWW). 

5.1 Data Set and Evaluation Measures 

The experiments are performed on the Financial Times (FT), a major subset of 
the TREC data set, using the 50 topics (i.e. queries) and relevance judgments 
used in TREC-6. Since we want to investigate the trade-off between quality and 
speed we need a good benchmark for the precision and recall. The used TREC 
relevance judgments are the most widely accepted retrieval quality benchmarks. 
Also, the FT document collection is sufficiently large to show the important 
effects. 

We defined four series of experiments, which we evaluate using the measures 
described in Figure Q 

5.2 Overview and Motivation of Experiments 

Here, we discuss the four series of experiments that we performed: 

Series I: Baseline. The first series of experiments are meant to show the quality 
and performance of our system without any special tricks: the Monet DBMS 
determines whether to build any access structures (usually hash tables) to speed 
up certain operations (for instance: joins). In this version the main focus was on 
the quality of the retrieval results and flexibility of the retrieval model. The effort 
to optimize this system for performance did not exceed the typical exploitation 
of certain typical alignment issues important in main memory computing. 
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Average Precision We define the average precision (AP) as: 

AP = avg{pi) 

where pi is the average 11-point-precision for each query i, i G {1, 2, 3, , 50}. So, the AP 

measure is actually the average average 11-point-precision. 

Average Retrieved Relevant We define the average retrieved relevant (APR) as: 

ARR = avg{ri) 

where is the number of relevant documents retrieved for each query i, i G {1, 2, 3, ... , 50}. 
Notice that the well-known recall measure is defined as Vi divided by the total number of 
relevant documents for query i. 

Average Execution Time The average execution time (AET) is defined as: 

AET = avg{ti) 

where ti is the (wall clock) execution time measured for each query i, and i G 
{1,2,3,. ..,50}. 



Fig. 7. Fragmentation algorithm 



Series II: Speed/Quality Trade-off. In the second series of experiments, we 
concentrate on the effects of ignoring data on the trade-off between quality and 
efficiency. We defined two variants of these series of experiments: 

(a) Always use the first fragment (and forget about the second fragment). 

(b) Take the first fragment, unless 

NIDFi IX Q = 0. 

We used a term-fragment index to allow an efficient choice, instead of using 
just this semijoin. 

Both types of experiments are executed for several different fragmentations, 
where the relative size in terms of the first fragment varies from 90% to 99.9%, 
using the fragmentation algorithm described above. 

Obviously, the (a) series can result in loss of quality; if none of the query terms 
have a high idf value, the answer set has been reduced to a random sample from 
the collection. The (b) series are meant to reduce this negative effect. 

We expect that the speed will decrease in favor of the quality with increasing 
first fragment size. The second experiment should be slower, since the evalua- 
tion of the second fragment triggered by some of the queries will increase the 
execution time considerably; though resulting in a better quality than for series 
(a). 



Series III: Benefits of Fragmenting. Series II focuses on the trade-off be- 
tween ignoring data to obtain speed compared to the quality of the resulting 
answers. However, the second fragment is still quite large in terms of data size, 
impeding main-memory execution. The experiments in series III are mainly in- 
tended to investigate the effects of executing our query algorithm on relatively 
small fragments. To do so, we fragment our database in 25 smaller fragments of 
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equal data size. The number of fragments was experimentally determined based 
on two constraints: there should be sufficiently many fragments to demonstrate 
the expected behavior, but, each fragment should still be reasonably large as to 
obtain the advantage of set-oriented processing. 

Again, we perform a couple of variants of these experiments: 

(a) This variant studies the effects of the fragmentation procedure described in 
Section El on execution time and quality of results. As in Series Ilb, we use a 
term-fragment index to efficiently determine whether a fragment should be 
evaluated or not. 

(b) This variant uses the same fragmentation as for (a), but this time we allow 
query evaluation to be cut-off after each fragment. The choice whether to 
stop processing the query (and after which fragment) is based on estimates 
whether the top- A can still be improved by processing of any following frag- 
ments. This strategy uses the computed lower and upper bounds to restrict 
the intermediate ranking to those documents that may still move into the 
top- A, thus limiting the computational efforts needed for any successive frag- 
ments still to be evaluated. We evaluate both the safe and (normal) unsafe 
cut-off principle in this variant. 

(c) As described in Section El one can relax certain conditions for the unsafe 
algorithm, obtaining a, what we call, heuristic unsafe method. This variant 
performs the heuristic version of the unsafe experiments done for the (b) 
variant, taking I G {0.00, 0.05, 0.10, 0.25, 1.00}. As explained before we 
used I to pre-select the fraction of a priori most interesting document^ that 
should not be ignored in case of intermediate result restriction in Part C". 

Since variant Ilia takes into account all query terms, we expect the AP and 
ARR to be equal to the figures measured in Series I. The AET will probably be 
better (e.g. lower) than for Series lib, since the overhead occurring from using 
an extra fragment is likely to be lower (since the fragments are smaller). 

Of course, the computation of the estimates in series Illb introduces an over- 
head in execution costs; this investment only pays off if the profits of the op- 
timization are high enough. The (b) variant of these series are meant to show 
whether this is still the case when applied to subsets-at-a-time processing rather 
than the element-at-a-time case studied in (^. Note that in 0 the results for 
the safe method showed no real significant performance improvement. 

As mentioned before, the quality is likely to be somewhat lower in case of 
the unsafe variant of the cut-off. 

We expect the IIIc variant to outperform Ilia and Illb (both for safe and 
unsafe runs) by far for I = 0.00 and quality to be lower but not really bad. For 
growing I we expect the performance to degrade rapidly since the overhead will 
grow significantly. However, in the case of I — 1.00 the quality should reach the 
same levels as measured for the Ilia variant. 



^ We used the document length which appeared as a natural candidate given the IR 
model we used. As stated before, other measures might be more appropriate in other 
environments. 



140 



H.E. Blok et al. 



Series IV: Influence of Query Length and Top-V Size. Because we cal- 
ibrate the quality measurements using the relevance judgments of the TREC- 
6 queries, the experiments in Series III have been performed with fairly long 
queries: an average length of 27 terms, and the longest query contains over 60 
terms. Also, TREC evaluation requires the top 1000 to be produced for each 
query. 

Series IV try to provide insight in (a) the effect of query length on the AET 
and (b) the effect of the size of required top-A on the AET. The (a) variant 
repeats the experiments of Series I (unfragmented case). Ilia (25 fragments, no 
top- A cut-off), Illb (25 fragments with normal safe/unsafe top- A cut-off), and 
IIIc (25 fragments with heuristic unsafe top-A cut-off) for limited query lengths. 
The new queries are constructed by taking the first k terms of each original query 
(or the entire original query in case it was shorter than k terms) . We let k range 
from 1 to 25. The (b) variant leaves the queries untouched and computes Series 
I, Ilia, Illb, and IIIc for several top-A sizes ranging from 10 to the original 1000. 

We expect that the (a) variant will show a relative performance advantage 
in favor of the top-A cut-off for longer queries compared to the cases without 
top-A cut-off. For shorter queries, fragmentation alone will already result in 
quite efficient processing whereas top-A cut-off would only cause extra compu- 
tational costs without much chance to gain a profit. The (b) variant is expected 
to demonstrate better AET values for lower A. The shorter the required top- 
A, the higher the lowest ranking value topLB occurring in the top-A; and, the 
higher topLB, the higher the value of topLB -|- contribLB — contribUB used 
to restrict the intermediate result RANK. This means that more elements in 
RANK are likely to be cut away. Also, the higher ranking values usually tend 
to be further away from their closest neighbors. So, the higher the topLB, the 
higher the chances that restUB will be so much further away (i.e. lower) that 
the gap cannot be bridged anymore: thus allowing for a top-A cut-off. In both 
cases the execution time is reduced, either due to less computational load per 
fragment or fewer fragments being evaluated. 

6 Experimental Results 

The hardware platform used to produce the results presented in this section is a 
dedicated PC running Linux 2.2.14, with two Pentium^^ III 600 MHz CPUs (of 
which only one was actually used in the experiments), 1 GB of main-memory, 
and a 100 GB disk array mounted in RAID 0 (striping) mode; no other user 
processes were allowed on the system while running the experiments. 

The remainder of this section is divided in four parts, corresponding with the 
four series of experiments. For each of these series of experiments, we included 
some figures /tables to illustrate the results. 

6.1 Series I: Baseline 

In this subsection we present the AP, ARR, and AET for the baseline run of 
the retrieval experiments. Table^shows the measured values, next to the values 
provided by TREC as the benchmark. The AET of course is not available for 
the benchmark. 



Experiences with IR TOP N Optimization in a Main Memory DBMS 



141 



Table 1. [Series I] Baseline result statistics (TREC benchmark included for compari- 
son) 





AP (%) 


ARR 


AET (a) 


Benchmark 
Series I 


100.0 

31.0 


31.8 

22.9 


44.4 



The ARR of 22.9 means that the recall of our unfragmented is 

^RR unfragmented 22.9 ^ 

avg. actual no. relevant 31.8 

Taking into consideration the fact that many IR systems stay below the 30% 
precision next to this fairly high recall, demonstrates that we used a state of the 
art IR model indeed. 

The AET of 44.1 seconds is of course not very competitive. However, note 
that this is also due to the relatively large (1000), and therefore expensive, top- IV 
we computed, required to legitimate the use of the TREC benchmarks. In Series 
IV we demonstrate that much better times can be achieved in case of a smaller 
top-V. Furthermore, the relatively long queries we used (again because of the 
use of the TREC benchmarks) also are quite costly when evaluated without 
any special measures like top-V optimization, which we did not exploit in these 
series, yet. 



6.2 Series II: Cut-off Moment 

Series II has been designed to develop an intuitive feel for the trade-off between 
quality and efficiency. Recall that only two fragments are used: a small fragment 
containing the ‘interesting’ terms and a much larger fragment containing mainly 
‘common’ terms. 



Series Ila: Use First Fragment Only. Figures 0 101 and [HI plot the HP, 
the ARR, and the AET of Series Ila, respectively, together with the baseline 
performance of the unfragmented case. Figure Elalso plots the average number of 
relevant documents in the collection, averaged over all topics. The x-axis denotes 
the term count of the first fragment in %o (i.e. tens of percentages) with respect 
to the total number of terms in the dictionary. 

The experiments confirm our expectations. The AP fragmented increases with 
increasing term count of the first fragment, moving towards the A P unfragmented- 
This also holds for the ARR fragmented^ respectively ARRunfragmented- The shape 
of the plot in Figure is also not surprising; since the data distribution is 
highly skewed, the data size of the first fragment grows faster and faster with 
increasing term count of the first fragment; explaining perfectly how the AET 
increases ever faster as the term count of the first fragment increases, reaching an 
AET of just over 44 seconds when the first fragment contains all terms (100%). 
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If the first fragment contains 99% of the terms, the AET is still 3.8 s while 
the ARR is 16.2 (or, average recall is 0.51) and the AP is 0.27: half of the 
documents that should have been retrieved are (on average) indeed retrieved, 
and, the average precision drops only a few percentages (to a level that various 
custom IR systems would not reach). In other words, a very reasonable quality 
can be reached in almost 20 seconds, which is more than 2 times faster than the 
time required to compute the best possible answers (given our retrieval model). 



Series Ilb: Use Second Fragment when First One is Unable to Handle 
Query. Even the best known retrieval models don’t exceed the 40% AP level. 
So, although the results shown in the previous case can be considered quite good 
compared to many other IR systems, the quality degradation still comes down 
to moving away further from that - already quite poor - upper limit of 40% AP. 
Series lib aims to investigate a possible improvement in the quality at the cost 
of, hopefully, only a minor fail-back in efficiency. 

As is clearly demonstrated by the results shown in Figures 0andl3 the quality 
of the results has improved significantly thanks to the switching technique. But, 
the AET has risen significantly (Figure EJ). This observation particularly holds 
for the fragment size ranges below 98.5%. For larger fragments, the AET has 
stayed the same. 

This behavior can be explained by the following argument. The larger the 
size of the first fragment, the more terms are handled by the first fragment; so, 
the higher are the chances that at least one of the query terms is contained in 
the first fragment. But, this also implies that the chances that a switch is needed 
drop. Conversely, the smaller the first fragment, the system switches more often, 
to a rather large second fragment; resulting in quite expensive execution costs. 

When increasing the number of terms in the first fragment up to approxi- 
mately 98.5%, the system switches less and less often to the second fragment; 
and, as the ‘data size’ of the first fragment is still relatively small, and the second 
(more ‘expensive’) fragment is used ever less, the total execution time drops. Up 
from 98.5%, the first fragment always contains at least one of the query terms. 
But, from that same point the data size of the first fragment starts to grow 
faster and faster: causing the AET to rise. Since from the 98.5% point up only 
the first fragment is used, the quality and performance coincide with the figures 
obtained at the previous experiments. 

As expected, the efficiency of Series lib is lower than that of the previous 
experiment, in particular for smaller first fragment ranges. But, the AET is still 
always 2 times smaller than for the unfragmented case and the quality exceeds, 
or at least equals (from around the point of 98.5% terms in the first fragment), 
the levels reached in Series Ila, as we had hoped. The AP never drops below 
0.23, and the ARR always stays above 14. Summarizing, the switching procedure 
does improve the quality, but in more extreme cases also degrades efficiency quite 
firmly; caused by either switching to an expensive second fragment (sizes smaller 
than 95%) or always operating on an often too expensive first fragment (up from 
99%). 
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900 910 920 930 940 950 960 970 980 990 1000 

first fragment size (10 x %) 

Fig. 8. [Series Ila/b] AP for several relative sizes, one fragment used, preferably the 
rst 
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Fig. 9. [Series Ila/b] ARR for several relative sizes, one fragment used, preferably the 
rst 
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Fig. 10. [Series Ila/b] AET for several relative sizes, one fragment used, preferably the 
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Concluding Remarks. These series of experiments clearly show the trade- 
off between speed and quality. They also demonstrated that, while remaining 
quite competitive quality, the efficiency of the retrieval process can be increased 
significantly by using this two-fragment approach. 

6.3 Series III: Benefits of Fragmenting 

In Series I we already showed the quality and performance results of the ‘no 
fancy tricks’ approach. These series focus on the situation where we have many 
equally sized (in terms of data size) fragments (25 to be precise). 

In Table 0 we listed the quality and performance results of the Series Ilia, 
Illb, and IIIc experiments. We also included the results of Series I and the TREC 
benchmark values for comparison. 

The results for the Series Ilia, where we only fragmented the database in 25 
fragments but did nothing else in particular to speed things up, clearly shows 
that fragmentation by itself does not introduce any extra costs. Also one clearly 
sees that the quality has not decreased in any way, as we expected, since all 
information of any relevance has been taken into account. 

The Series Illb shows the results for the experiments where we used normal 
safe/unsafe top- A cut-off. As we predicted, the quality has degraded for the 
unsafe top-A technique (but only slightly) and stayed the same for the safe 
method. 

Although we did anticipate on a poor performance gain for the safe method, 
the drop in performance was rather unexpected. The unsafe method was ex- 
pected to perform even more better than the safe approach, but also shows 
disappointing execution times. This performance degrade of course is the oppo- 
site of what we intended to happen. A more close review of our log files learned 
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Table 2. [Series III] Results of experiments with 25 fragments, with and without top-A^ 
cut-off (TREC benchmark and Series I results included for comparison) 





AP (%) 


ARR 


AET (s) 


Benchmark 


100.0 


31.8 


- 


Series I 


31.0 


22.9 


44.4 


Series Ilia 


31.0 


22.9 


44.8 


Series Illb 
(safe top-A) 


31.0 


22.9 


50.9 


Series Illb 
(unsafe top- A) 


31.0 


22.7 


51.0 


Series IIIc {1 = 0.00) 


30.0 


15.1 


7.9 


Series IIIc = 0.05) 


29.8 


15.6 


13.5 


Series IIIc {1 = 0.10) 


29.7 


15.9 


18.7 


Series IIIc = 0.25) 


30.0 


17.6 


33.0 


Series IIIc \l = 1.00) 


30.1 


22.9 


89.1 



that the cut-off conditions were too weak, allowing a cut-off in only rare cases. 
Also the intermediate result restriction technique appeared to suffer from the 
same weak boundaries resulting in no effective limitation of the computational 
effort. Due to the extra administrative work needed for the desired but never 
occurring cut-off this resulted in a performance degrade instead of a performance 
gain. 

However, the reasons for the disappointing results for Illb also explain the 
huge performance gain for the life case with low 1. For the Ilia (and Illb) 
case the computational effort (indeed) appeared to increase for fragments with 
terms with higher df - e.g. the fragments in the end of the fragment-sequence 
- due to the Zipfian nature of the data. In case of IIIc the intermediate result 
restriction did occur with almost no exception, reducing the computational effort 
per fragment to almost a constant factor. Furthermore, the AP stayed almost 
the same, while the ARR dropped a bit more. However, the recall still is about 
50% in the worst case, which is not really that bad. For the case of I = 1.00 
the quality indeed equals the values measured for the Hla case, as expected. 
However, the performance for this case is very bad, as one can expect of this 
naive approach to forcefully rank all documents. 



6.4 Series IV 

Here we describe the measured performance and quality results when we relax 
the requirements we used to comply with the TREC evaluation standards till 
now. The (a) variant shows the effects of shorter queries, whereas the (b) series 
show what happens when a smaller top-A is delivered. 



Series IVa: Influence of Query Length. Figures nEl and niH plot the AP, 
the ARR, and the AET of Series IVa, respectively, together with the baseline 
performance of the unfragmented case. 
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Fig. 11. [Series IVa] AP for several max. query lengths, no/safe/unsafe/(heuristic) 
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unsafe top- TV cut-off 
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Fig. 13. [Series IVa] AET for several max. query lengths, no/unsafe/unsafe/(heuristic) 
unsafe top- TV cut-off 



As expected, the smaller the query length the better the performance. And, 
although not completely compliant with the usual TREC evaluation standards, 
we also performed the query result quality evaluation, which not surprisingly, 
shows a degrade for reducing query length. Again, the normal safe and unsafe 
techniques do not result in a significant performance gain. The heuristic method, 
in turn, shows very good performance for the lower I values. For I — 0.00 the exe- 
cution times per query only lightly increases for growing query lengths. However, 
for growing I the performance collapses quickly. 



Series IVb: Influence of Top- TV Size. Again, we combined all the results of 
these series into 3 plots, being the Figures El [El and El 

We expected the performance to increase for decreasing top-A size. This 
indeed does happen, but it clearly only happens for really small top-A sizes, and 
then still only in a minimal form, which is less than we hoped for. Apparently 
the size of the required top-A does not really affect the computational effort that 
is required. Probably this has to do with the fact that the top-A cut-off did not 
really work as we expected and that the main performance gain is obtained from 
reducing the intermediate results/work. This latter observation is supported by 
the fact that the heuristic unsafe method always results in significantly lower 
execution times for lower values of I, independent of the size of the top-A. In 
general, the differences between the results obtained for the used optimization 
techniques are clearly visible and resemble the figures we already saw for the (a) 
variant. 
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Fig. 14. [Series IVb] AP for several top-AT sizes, no/safe/unsafe/ (heuristic) unsafe 
top- AT cut-off 




Fig. 15. [Series IVb] ARR for several top-Ai sizes, no/safe/unsafe/ (heuristic) unsafe 
top- AT cut-off 
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Fig. 16. [Series IVb] AET for several top-fV sizes, no/safe/unsafe/ (heuristic) unsafe 
top-A^ cut-off 



Concluding Remarks. The effects we hoped to see for the (a) variant indeed 
occurred and the heuristic unsafe cut-off technique seems very promising due 
to its still relatively good quality along with very good performance for low I 
values. The (b) variant also, in a sense, did show what we expected, but much less 
significantly than we hoped for. Apparently the size of the top-A is not really an 
important issue in our case. Future research has to show whether we can improve 
the top-A cut-off conditions to obtain effective top-A cut-off behavior indeed. 
Maybe then indeed the size of the top-A will turn out to be of significance. 

Fortunately, our heuristic unsafe method does show very interesting perfor- 
mance gain with only minor quality loss. 



7 Conclusions and Future Work 

This paper presents a convincing case for the suitability of the ‘database ap- 
proach’ in the non-standard domain of information retrieval. We first specified 
the typical IR retrieval process declaratively. This allows the integration of IR 
techniques in our prototype DBMS, without fixing the physical execution of 
queries that use these techniques on a predetermined order, which is particularly 
important for the development of search engines for XML documents, handling 
queries that refer to a combination of traditional boolean retrieval with retrieval 
by content. 

The experimental validation of our proposed techniques confirm strongly 
the expected quality versus efficiency trade-off. Series II and III establish the 
suitability of data fragmentation as an instrument to tailor the physical database 
design to match the hardware restrictions of the server machines. The final series 
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of experiments demonstrates further evidence in favor of further adaptation of 
our fragmentation method for top-iV optimization techniques. 

Summarizing, our results demonstrate convincingly that the smart usage of 
domain knowledge can significantly improve the retrieval efficiency when oper- 
ating in a database context. Note that for short queries (i.e. only a couple of 
terms) the execution times reduce to only a few seconds per query when using 
our heuristic unsafe top-iV cut-off technique. This even outperforms the initial 
Google of few years ago for uncached short queries (HI. Of course Google then 
(already) operated on a data collection of about 100 times bigger than the one 
we used in this paper. Also, such state of the art search engines make use of 
(query) caching techniques closely related to database optimization techniques 
like multi query optimization, which we haven’t incorporated in our system, yet. 

Based on the results of Series II we have already devised a first prototype cost 
model that seems to predict the execution costs of our fragmented query evalua- 
tion approach very accurately (also see P). We eventually plan to use this model 
to optimize the allocation of fragments on a shared nothing parallel system. Next 
to that we plan to incorporate the cost model in the (physical)optimizer under 
the logical level of our system (i.e. Moa). This will allow the optimization of the 
IR part to blend in with the rest of the optimizer, due to the transparent nature 
of our approach. To evaluate the efficiency gain and opportunities for scalability 
of this cost based optimizer we are setting up a database with a data collection 
that is a 100 times larger than the one we used to perform the experiments 
presented in this paper. We plan to exploit the parallel processing features of 
our physical (Monet) and logical (Moa) layers to cope with this dataset using a 
cluster of PGs similar to the one we described in Section El Our ultimate goal 
is to demonstrate that our fragmentation approach indeed does allow seamless 
integration of multi media information retrieval technology in a DBMS in an 
efficient, scalable, and flexible manner. 

Finally we want to point out that a dedicated IR system most likely always 
will outperform the best database solution but will lack its flexibility, scalability, 
and general efficiency. This holds in particular when dealing with both structured 
and unstructured (like text content) data. Our goal is to find a database solution 
that at least shows acceptable performance for the unstructured part. We see 
the results presented in this paper are a first step in the right direction. 
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Abstract. Integrating, cleaning and analyzing data from heteroge- 
neous sources is often complicated by the large amounts of data and 
its physical distribution which can result in poor query response time. 
One approach to speed up the processing is to reduce the cardinality of 
results - either by querying only the first tuples or by obtaining a sample 
for further processing. In this paper we address the processing of such 
queries in a multidatabase environment. We discuss implementations of 
the query operators, strategies for their placement in a query plan and 
particularly the usage of histograms for estimating attribute value dis- 
tributions and result cardinalities in order to parameterize the operators. 

Keywords: Result Gardinality, Histograms, Multidatabase, Optimiza- 
tion 



1 Introduction 

In a large number of application areas integration of data from heterogeneous 
sources is required, e.g., for building federated information systems or data ware- 
houses. Besides integration of data there is often also a need for cleaning and 
analyzing the data in order to obtain qualitatively appropriate results. 

By means of multidatabase languages (for instance MSQL |GLri.S9,3j . 
SchemaSQL [IjSS96| . FraQL jSGSOOj l we have the tools at hand which are 
needed for querying across diverse data sources. Querying using multiple data 
sources usually produces complete result sets. This requires processing large 
amounts of data and, thus, results in very poor query response time taking the 
physical distribution of the data into account. 

In data analysis applications such as OLAP, data mining or information fu- 
sion we often want to get the ‘first’ results quickly, e.g., in order to find interesting 
regions in data or to parameterize the methods and tools. Another example is 
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the case of identifying conflicting values for semantically related attributes stored 
in different databases. In this case, one does not want to obtain all conflicting 
pairs of values because there might be too many of them for manual inspection. 
Instead, a certain number of conflicting pairs given as examples can help to un- 
derstand the basic problem (e.g. different scaling of values in the different data 
sources). Then, adding a corresponding conflict reconciliation function to the 
multidatabase query used for detecting this conflict should show whether there 
are no further conflicts (i.e., the reconciliation function is working for all data) 
or whether we have to modify the reconciliation function in order to capture 
more conflicts. 

Thus, if multidatabase features are combined with techniques for reducing 
query response time by limiting result cardinalities, more explorative and inter- 
active data integration and analysis will be possible. Unfortunately, all multi- 
database languages so far proposed do not allow requests for a specified number 
of resulting tuples as examples instead of the complete result set. Therefore, we 
are seeking suitable extensions of multidatabase languages leading to efficient 
retrieval of example data. In this paper we will explore two ways of getting such 
example data: 

1. asking for the first n results of an integrating multidatabase query and 

2. asking for a sample containing n tuples of the complete result (or for a 

certain percentage of resulting tuples). 

These two possibilities have already been considered in other contexts. For in- 
stance asking for the first n (or the best n) results is typical for information 
retrieval. Much of the work regarding this subject proposes optimization for eval- 
uating such queries (cf. e.g. icmTnnMi). In contrast to these approaches we 
have to deal with the problem that in a multidatabase environment there are 
usually legacy systems acting as local data sources. 

These local systems may have their own query processing and optimization 
engine (in particular, if they are database management systems) . If such a system 
offers the possibility to retrieve the first n tuples or a sample of a result, we 
obviously should try to use this possibility instead of transferring the complete 
result set for a query to the multidatabase system and computing the first n 
tuples or the sample there. 

Another aspect that we give detailed consideration to in this paper is the 
case that statistical data (histograms) about the data stored in a local source 
is available and can be accessed by the multidatabase system. In this case we 
develop a global query optimization taking this meta-data into account - focusing 
on the processing of ‘first n’ and ‘sample’ queries. 

Our work is based on the object-relational multidatabase language FraQL 
which in particular allows the dynamic addition of user-defined conflict reconcil- 
iation functions. For this language a query engine has been implemented which 
is able to access heterogeneous database systems by means of specific database 
adapters. In our current prototype environment we are using native adapters 
for Oracle and MySQL and access other data sources via ODBC. So, the main 
contribution of this paper is the application of histogram-based techniques for 
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optimization and processing of ‘first n’ and ‘sample’ queries under the special 
circumstances of a multidatabase system. 

The remainder of this paper is organized as follows. In the next section we 
briefly present related work. Section 0 gives an overview of basic techniques for 
limiting result cardinalities and sampling described in the related literature and 
discusses their suitability in multidatabases. In Section 0] we describe the usage 
of histograms for estimating query parameters such as intermediate result sizes 
and attribute value distributions in the FraQL system, which are essential for 
optimizing and processing flrst-n and sample-n queries. Some evaluation results 
for our approach are presented in Section 0 Finally, we conclude by summarizing 
the main insights and by pointing out future work. 



2 Related Work 



Statistical methods have been used in central database systems for twenty years, 
predominantly in the area of the query optimization and query result size estima- 
tion. Recently, strongly associated to data warehouse techniques, several works 
on this matter investigate how to limit the query results and how to provide ap- 
proximate answers to user queries. An overview of these data reduction methods 
is given in |BDF+97| . ICK97IC K98j discuss an approach to restrict the result set 
by allowing the user to specify the desired result set size. This is accomplished by 
the SQL extension STOP AFTER. The intermediate results are limited by placing 
a stop node in the query tree. The authors propose two optimization strategies, 
namely a conservative and an aggressive. In addition they recommend a restart 
node in cases in which the original stop node did not produce the desired re- 
sult size. Several commercial database management systems provide a similar 
technique to compute the top-n results. 

Sampling is another technique for data reduction. The authors of 
describe different kinds of uniform random sampling techniques in a DBMS, 
because the integration of sampling in a database system can increase the per- 
formance of the sample computation. They discuss several techniques for uniform 
random sampling from base relations or the output of relational operators. 

In fCMlN99j and [A(IFH99| the join sampling problem is pointed out as 
an example of the problem of commuting the sample operator with relational 
operators. |A(IFH99j uses precomputed join samples, so-called join synopses, to 
provide approximate answers for join aggregation queries. These synopses are 
well suited for star or snowflake schemas which are usual in the data warehouse 
area. This approach is implemented in the AQUA system mm , which works 
on top of any commercial DBMS and stores its precomputed statistical data 
in relations within the DBMS. For providing fast approximate answers for user 
queries, the system rewrites the query using the AQUA relations instead of the 
base relations and scales the aggregated query in the desired manner. 

Using precomputed histograms for determining approximate answers is yet 
another possibility to reduce the query result size and to achieve short query re- 
sponse times. This technique is among others described in mm . The authors 
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show that it is possible to execute non-aggregate and aggregate queries using 
this method. Thereby the queries are executed using the histograms instead of 
the base relations. 

An interactive and iterative way to provide approximate answers for aggre- 
gated queries, called online-aggregation, is described in IHHW<17I . Here the user 
starts with a relative imprecise answer provided by a first small random sample 
of the data. This initial value will be improved during the processing. The user 
can observe online the value changes and the error bounds in order to decide 
when the exactness of the answer is sufficient for his needs. 

3 Result Cardinality Limitation in Query Processing 

As discussed in the previous section there are several approaches to limit the 
result size of a query in order to improve the response time of query evaluation. 
In the following we will focus on two approaches which are implemented as part 
of the multidatabase system FraQL query processor fSCSHOj : LIMIT FIRST and 
LIMIT SAMPLE. Both are extensions to the standard SQL SELECT statement: 

SELECT <projection list> 

FROM <table expression> 

[ WHERE <condition> ] 

[ ORDER BY <order spec> ] 

LIMIT FIRST I SAMPLE <value expr> [PERCENT] 

The parameter <value expr> can be any expression which represents a positive 
integer value including zero. Thus it can be a constant, a functional expression or 
a sub-query that is not correlated with the main query. If the keyword PERCENT 
is given, <value expr> denotes the percentage of the desired result size. 

3.1 First n Tuples 

With LIMIT FIRST at most <value expr> tuples are retrieved from the result 
set, if they exist. Please note that grouping and ordering have higher priority 
than cardinality reduction, so these operations are performed before any kind 
of limitation. Conversely, projection as well as aggregations have lower priority, 
i.e., the query 

SELECT avg(balance) 

FROM Accounts 
ORDER BY balance 
LIMIT FIRST 10 PERCENT 

computes the average of the top 10% of the account balances. 

Obviously, the cardinality limitation could be performed on top of the 
database engine in the application by closing the database cursor when the limit 
is reached. However, the performance benefits would be rather low. Thus, a spe- 
cial query operator is required which can be placed in the query execution plan 
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and ‘cuts’ the tuple stream after the desired cardinality. Following the operator 
introduced in EM, we added an operator stop to our query engine, which 
implements the iterator model |(lra9d| and passes a given number of tuples from 
the input stream. At physical level the stop operator has several implementa- 
tions: a simple pipelined scan-stop operation for unordered limitations and a 
blocking sort-stop operation, that collects the top or bottom n tuples from the 
input stream in a sorted heap and produces the result set after processing the 
whole input. 

In order to minimize the costs for query execution the stop operator should 
placed low in the operator graph. In EM two placement strategies are dis- 
cussed. With the conservative policy, the stop operator is inserted at a point in 
the query plan where no tuple is discarded that might be part of the final result. 
Let Opi be an operator of a plan P = Op\Op 2 ■ ■ ■ Opi-iOpi . . . Opr with Opr as 
root operator and card(Op) the cardinality of the result produced by Op, then 
Opi is cardinality-preserving if the following condition holds: 

card(Opi) > card(Opi_i) 

The aggressive approach tries to place the stop operator earlier in the plan, 

i.e., even where it could provide a cardinality reduction. This requires estimat- 
ing the stop cardinality by using database statistics as well as a restart operator 
which ensures that the desired number of tuples is produced even if the esti- 
mated stop cardinality was too low. In this case, the sub-query below the restart 
operator has to recompute the missing tuples. 

The scenario which we support with our FraQL system contains some spe- 
cial characteristics. At first, we operate in a multidatabase environment, i.e., 
parts of the query are performed by the local component databases which are 
often full-fledged DBMS. Thus, we want to exploit the ability of the sources to 
limit the result cardinality in order to reduce the communication costs and the 
query evaluation effort at global level. Restarting a query could be very expen- 
sive, so a safe estimation of the cardinality limit is needed. The second specialty 
of our scenario is a relaxed requirement regarding the exactness of the cardinality 
limitation. For supporting explorative data analysis tasks it is more important 
to get results meeting a given criterion very quickly. In addition, if a percental 
limit was specified, a small discrepancy is often tolerable. Thus, we embark on 
a strategy for placing the stop operator in the query plan, which is enforced by 
the following rules: 

1. The main goal is to insert stop operators as deep as possible in the query 
execution tree according to the query’s semantics and the capabilities of the 
component databases. 

2. A safe placement is possible if the subsequent operators are cardinality- 
preserving and contain no sorting (cf. the conservative policy mentioned 
above). In this case the limit parameter need not be modified. 

3. If the query contains a sort operator which cannot be performed by the 
component databases, this operator is replaced by a sort-stop operator. 
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4. If the remaining operators are not cardinality-preserving an unsafe placement 
can be performed, i.e. the limit parameter has to be recomputed. 

5. For unsafe placement an additional stop operator has to be inserted near the 
root of the global plan respecting the higher priority of grouping operators. 

6. For a join operation the stop operator is inserted only in one of the branches, 
either according to the safe placement policy, i.e., in the branch, for which 
the join predicate is cardinality-preserving, or - if no advantage can be taken 
from referential integrity constraints - in an unsafe manner by choosing the 
branch where it effects the largest decrease of costs. 

A plan for a query containing a LIMIT FIRST clause is constructed as follows: 
after substituting global view definitions, performing the usual transformations 
(e.g., standardization, simplification EEHl) and decomposing the query into 
sub-queries processable by the sources, the optimizer seeks to insert a stop oper- 
ator according to the rules given above at the root of the sub-query. If the global 
remaining operations are not cardinality-preserving, the limit parameter has to 
be adjusted by estimating the cardinality of the sub-query, which is computed 
from the selectivity of the operations and the histograms of the base relations 
as well as the intermediate results. Let card(Pa«) be the estimated result size 
without limitations and n the limit specified in the LIMIT FIRST query. So, the 
proper cardinality limit for the stop operator above operator Opi is as follows: 

^ _ card(Opi) * n 

" card(P,„) 

An additional stop operator is placed at the root of the global plan just to 
ensure that not too many tuples are produced. In case of a LIMIT FIRST . . . 
PERCENT clause the unlimited result size is estimated from the histograms and 
the percentage needed for parameterizing the stop operator is computed. 

3.2 Random Sampling 

By using the notation LIMIT SAMPLE <value expr> [PERCENT] the system gen- 
erates a simple uniform sample of size n of the query result. An efficient compu- 
tation requires a sample operator, as described in fORRh] . being applied as low 
as possible in the query plan. 

As we are in a multi database environment, there are several constraints, 
which have to be considered. Our system uses virtual integrated relations, so 
there are no indexes or complete statistics available. Furthermore an efficient 
random access is not possible. But on the other hand, the different sources can 
have different features, which have to be exploited. The histogram capabilities 
of the FraQL system have to be taken into account for optimizing sampling 
queries. 

Because of missing random access to the data, sequential sampling algorithms 
have to be utilized. There are two types of scenarios: known and unknown rela- 
tion sizes. In the first case, algorithms as described in can be used, which 

have the advantage of not blocking. In the second case sampling with reservoir 
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is necessary. These algorithms do not need the relation size, but provide 
the first tuples only after the complete scan of the relation. 

After the description of our environment and constraints, we now want to 
show which approaches can be adapted to our scenario. The crucial point is 
sampling of join and union operations. 

Several approaches of random join sampling exists in literature. The objective 
of such strategies is to push down the sampling operator on one side of the 
operator tree since it is not possible to use sampling on both relations |( JIVI Nhh] . 
Possible strategies are: 

— Naive sampling includes a first complete computation of the join of R and 
T followed by the application of the sample operator. 

— The second strategy is proposed in j()R.86j and includes the following steps. 
Consider the computation of a join of R and T. First sample uniform ran- 
domly one tuple from R and join it with T and getting the result V. Select 
randomly one tuple from V and accept it with the probability proportional 
to the cardinality of V. These steps are repeated until the required sample 
size n is obtained. 

— In ICMIN 99l further join sample strategies were proposed, which only require 
statistics or partial statistics on one relation. Group-sample is one strategy 
of these and consists of following steps for the join of the relations R and T: 

1. First produce a weighted WR-sample of relation R of the size n. The 
weight Lo{t) for a tuple t is the number of distinct tuple with value v in 
join attribute t.A. This sample is denoted by S\. 

2. Join Si with T and group the join after the tuples of ^i. The result is 
^ 2 . 

3. The last step consists of picking out one tuple from each group of S 2 
using a unweighted random sample algorithm. 

With the above constraints, the sampling approach according to can- 

not be applied in our environment because it requires an index as well as full 
statistics. However, the naive sampling algorithm or group-sample is possible. 
To support the latter strategy the FraQL system supports a histogram scheme, 
the application of which is described in section 01 

There are different approaches to obtain samples of the union operation. 
These techniques require indexes and statistics and from there a materialization. 
So they cannot be used in our scenario. 



4 Using Histograms 

In relational database systems, information about cardinalities of the relations 
and the distributions of attribute values is essential for calculating the costs 
of query execution plans. Thus, modern systems maintain statistical informa- 
tion mostly in the form of histograms which are particularly well suited to the 
representation of approximations of non-parametric distributions. Using this in- 
formation the query optimizer is capable of estimating selectivities of operators 



Limiting Result Cardinalities for Multidatabase Queries Using Histograms 159 



and cardinalities of result sets. A histogram consists of a set of buckets repre- 
senting a subset of values of an attribute. Each bucket contains the number of 
tuples having the value associated with the bucket. In the following we denote 
the number of buckets as B and the number of tuples in the i-th bucket as 
bucket(i) for i = 1 . . . B. 

There are various classes of histograms used in database systems nowadays. 
In equi-width histograms the size of the range of values in each bucket is the 
same, whereas in equi-height histograms the frequencies of the attribute values 
contained in each bucket are equal. Typically, equi-height histograms are used in 
database systems because of the lower worst case error |PSC84j . In [HP95j serial 
and end-biased histograms are proposed as optimal solution in many cases, but 
they are currently not very common. However, using histograms in a heteroge- 
neous database environment entails several difficulties: 

— the representation of histograms in system catalogs differs for the various 
available database systems, 

— the availability and efficient access to histograms are crucial factors for pro- 
cessing global queries, 

— there are data sources which do not maintain histograms or for which his- 
tograms cannot be calculated due to limited query capabilities. 

Thus, for our FraQL system we have chosen an approach for histogram mainte- 
nance where histograms of global visible relations, i.e., the integrated relations, 
are kept in the global layer of the multidatabase. When a relation is ‘imported’ 
from a source, i.e., a global virtual view is defined in FraQL on this relation, 
the histograms for these relations are retrieved. The adapters for the individ- 
ual database systems participating on the multidatabase are responsible for a 
uniform access to the system-specific histograms. So, each adapter provides a 
method for obtaining a histogram which can be implemented in one of the fol- 
lowing ways: 

— get the histogram directly from the database catalog of the source, 

— trigger the computation of the histogram in the source (e.g., ‘compute statis- 
tics’ in Oracle), 

— compute the histogram in the adapter itself, 

— construct a trivial histogram representing an equipartition if neither a his- 
togram is available nor can it be computed. 

Obviously, caching histograms in the federation layer is a compromise between 
efficient access as well as availability and the actuality of statistics on data. 
However, for our intended application scenario - data analysis in heterogeneous 
databases - this approach seems to be practical. 

In the following subsection we describe the usage of histograms for supporting 
result cardinality limitation techniques introduced in Section 0 In particular we 
discuss the calculation of estimations for intermediate result cardinalities and 
distributions as implemented in the FraQL query processor. 
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4.1 Using Histograms for Estimating Stop Cardinalities 

In Section 0 we have identified cardinality estimation as an important task for 
parameterizing the stop operator, if only an unsafe placement is possible. For this 
purpose histograms are very helpful. Based on histograms of the base relations, 
the distribution and cardinality of the intermediate results after applying the 
particular operators of the query plan can be estimated by the global query 
optimizer. Finally, the limit parameter for the stop operator can be derived 
according to equation (1). This approach is implemented in FraQL system by 
means of the following three steps: 

1. Traversing the operator tree top-down, all attributes are determined for 
which histograms are needed, i.e., attributes referenced in expressions of 
join or selection conditions for example. 

2. For each operator node the attribute value distribution in form of the his- 
togram, the cardinality of the result set as well as the selectivity of the oper- 
ator are calculated. This is performed bottom-up for all attributes identified 
in Step 1. 

3. The limit parameter for the stop operator is calculated using equation (1). 
Fig. in illustrates these steps for the operator tree of the query: 

SELECT * 

FROM customers c, insurances i, accounts a 
WHERE c . cust_id = i . cust_id AND 

c.cust_id = a.account_no AND 
a. balance > 6000 
LIMIT FIRST 10 PERCENT; 



Assuming that coincided histograms are available for the attributes c.cust_id 
and i . cust_id (denoted as histi(custJd) and hist 2 (custJd)) of the base relations 
the buckets of the histogram hists (custJd ) for the join c ixi j can be calculated 
using the following formula |SS94j : 



yj = 1 . . . B : bucket c^i{j) 



bucket c{j) * hucketi(j) 
max{dc, di) 



Here, dc and di are the numbers of distinct values present in the join column from 
c or i respectively. If the histograms do not coincide a preceding normalization 
step has to be performed. 

For the selection operator on relation accounts the histogram can only be 
derived indirectly by calculating the selectivity of the operator. According to 
the formulas presented in we can estimate the selectivity sel for the 

expression balance > 6000. Let seloc be the estimated selectivity of comparison 
attr 9 cfor 9 = {<, >, <, >, =}. Since 



selyc = 1 — sel<c and 
sel<ic — sel^c sel—c 
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' ^ . card{PAii) 

bistf (custJd) 

4 



^stop stop j j 




histe (account-no) 



sel(balance > 6000 ) 



hists (account-uo) 
hJst4 (balance) 



(^rastomers ^ ; ; ( insurances ^ 
histi(cust-id) hist2(cust-id) 



Fig. 1. Histogram estimation for a LIMIT FIRST-query 



we have only to estimate sel^c and sel^c- In the following formulas are 

given for the case where the value c is in the fc-th bucket: 



scl^Q — 



fc - 1 + 1/3 
B+1 ' 



sel^c 



1/3 

B+1 



Based on these formulas the selectivity of the expression can be computed using 
the histogram of a. balance. This value is used to adjust the histogram for 
a.account_no (denoted as histg (account-no)) by reducing the height of each 
bucket assuming independence of the attributes: 



Vj = 1 ... B : bucket(j) t— bucket(j) * seZ^gooo 



Next, the histogram histr(custJd) for (c [xi z) ixi a is calculated as shown above, 
whereas the stop operator is ignored for the moment. The cardinality of the final 
result set at the root of the operator tree is equal to the sum of the heights of 
all buckets of this histogram: 

B 

card = bucket(j) 
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Finally, the limit parameter for the stop operator is calculated. This requires 
firstly estimating the percentage of the cardinality of the whole relation. This 
value then can be inserted as parameter n in equation (1). 

A special treatment is required for histograms of relations containing at- 
tributes which are transformed by applying so-called mapping functions 
as part of the view definition. Because the mapping is implemented as a spe- 
cial query operator in the query plan the involved histogram also has to be 
mapped. A straightforward solution is to apply the mapping function to each 
bucket boundary value. 

4.2 Using Histograms for Sampling 

In this section the use of histograms to support the sample operation is discussed. 
As shown in section 0 there is a need to apply a weighted sampling algorithm to 
overcome data skew and problems with join sampling. Calculating these weights 
requires knowledge regarding frequencies of the distinct values which can be 
provided by histograms. The following example query computes a random sample 
of size 1,000 of the join between the relations customers and insurances. 

SELECT * 

FROM customers c, insurances i 
WHERE c.cust_id = i.cust_id 
LIMIT SAMPLE 1000 

In order to improve the performance the sample operator has to be moved 
towards the leaves in the operator tree. One strategy, the group sample men- 
tioned above, is introduced in ICMINOOI . Figure shows how we support this 
strategy with histograms. 




Fig. 2. Histogram estimation for a LIMIT SAMPLE-query 



We want to sample the join of customers and insurainces on the attribute 
cust_id. According to the group-sample strategy we therefore need frequency 
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information about the distinct values of cust_id in relation insurances as pro- 
vided by a histogram. A weight u>{t) of a tuple t of the relation c is calculated 
as follows: 

uj{t) = card{Paii) * sel=t.cust.id{i)- 

The expression sel=t.custsd{i) denotes the selectivity of the value in relation 
insurances and is computed by the statistical data in the histogram for the 
column insurances . cust_id. So the first step of the group-sample strategy is 
accomplished. The second step is performed in the group-sample-join operator, 
whose output is a sample of the join of the relations customers and insurances. 

5 Evaluation 

The main focus of the following empirical evaluation is not on the pos- 
sible performance gains, because these depend strongly on the characteris- 
tics of the involved data sources. Instead, we evaluate the quality of esti- 
mations and results, which rely on statistical information contained in his- 
tograms as previously described. For this purpose the following schema is used: 
db#l: customers ( cust_id , income) 

insurances ( insurance_id , cust_id) 
db#2: accounts ( account _id , account_no, balance) 

Database db#l consists of two relations customers and insurcinces. It stores 
information about 250,000 customers, each one having one, two or no insurances 
with the average of one. No special distribution describes the attribute cust_id, 
but there is a foreign key constraint from insurcuices to customers. The sec- 
ond database db#2 to be integrated to a global view consists of a table of about 
325,000 accounts. The attribute accountjio matches to customers . cust_id 
and for about 57% of all customers at least one account can be found. The bal- 
ances are normally distributed with a mean of 10,000 and a standard deviation of 
5,000. For all involved attributes equi-height histograms consisting of 10 buckets 
are calculated. 

The example query executed on this constellation leads to the access plan 
shown in tabled Here, x in step 7 stands for the requested cardinality in per- 
cent. Because in the average case there exists one insurance per customer, the 
cardinality of the index scan in step 2 is approximately the same as the number 
of tuples in customers. The limit Lsub for the subplan depends on the choice of 
the percentage of data sets to be retrieved. It is calculated with support of the 
histogram. To verify, whether this calculation step leads to a correct limitation, 
the desired and the actual generated result sizes for estimated Lgub, computed 
as illustrated in section ICTl are compared in Fig. El a). The number of tuples of 
the exact calculated percentage is compared with the number of tuples retrieved 
using the estimation techniques in Fig. EJb). It can be seen that the histogram 
supported estimation leads to a result size which is near to the requested car- 
dinality. The difference increases with higher limit values. For the considered 
query, the number of retrieved tuples is underestimated in all cases, so that at 
least the desired cardinality is provided. 
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Table 1. Access plan: verification of cardinality estimation 



Step (i) Operator(Opi) 


Cardinality ( car d ( Opi ) ) 


7 


Select 


« L = card{Paii) ■ ^ 


6 


Join results from Step 4 and 5 


see sectional 


5 


Table access [accounts] 


card(accounts) 


4 


Stop Lsub 


^ Lsub 


3 


Join [customers] and [insurances] 
using index on [insurances] 


~ card(customers) 


2 


Unique index range scan 


« card(customers) 




[Primary key of insurances] 


« card(insurances) 


1 


Table access [customers] 


card(customers) 




(a) Comparison between estimated and retrieved size 




(b) Relative estimation error 



Fig. 3. Evaluation results 
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The quality of the sample operation is verified by testing, whether the existing 
data distributions are maintained or not. For this test a sample of 1,000 tuples 
of the joined relations customers and insurainces is generated. The access plan 
is shown in table |21 The methodology of at first generating a weighted sample 
and applying the modified join operation on the result is described in detail 
in [CM1N90| as stream-sample strategy. The attribute income is in the base 



Table 2. Access plan: verification of maintaining of distribution 



Step (i) Operator(Opi) 


5 


Select 


4 


Join 


3 


Weighted sample with replacement of [customers] 
Weights are frequencies from Step 2 


2 


Histogram access (frequency of cust_id [insurances] 


1 


Table access [customers] 



relation [customers] approximately normally distributed with a mean of 5,000 
and a standard deviation of 1,000. Using a goodness-of-fit statistic we test the 
hypothesis that this distribution is maintained by the join operation on a sample 
of ni = 100 and ri 2 = 1000 tuples. Let the level of significance be a = 0.05 and 
the test intervals Ai = (— oo; 1000], A 2 = (1000; 2000); •• • ; Aio = (9000; 00 ]. 
Further, let freqj be the observed frequency of a value of the sample in interval 
j and pj the theoretical expected count. Then, the value of the test function 



V = 



1 yfreq] 
n ^ Pj 



( 2 ) 



calculates to 8.69 in the case of the sample size ni and to 6.31 in the case of 
1000 data sets to be retrieved. Because the value of the x\-a fractile of the 
distribution y^{k — 1) = x^(9) = 16.92 is larger than these values, we cannot 
reject the hypothesis that the original distribution is maintained. Consequently, 
sampling constitutes an adequate way to reduce the data for analysis purposes. 
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Abstract. Query throughput is one of the primary optimization goals 
in interactive web-based information systems in order to achieve the 
performance necessary to serve large user communities. Queries in this 
application domain differ significantly from those in traditional database 
applications: they are of lower complexity and almost exclusively read- 
only. The architecture we propose here is specifically tailored to take 
advantage of the query characteristics. It is based on a large parallel 
shared-nothing database cluster where each node runs a separate server 
with a fully replicated copy of the database. A query is assigned and 
entirely executed on one single node avoiding network contention or syn- 
chronization effects. However, the actual key to enhanced throughput 
is a resource efficient scheduling of the arriving queries. We develop a 
simple and robust scheduling scheme that takes the currently memory 
resident data at each server into account and trades off memory re-use 
and execution time, reordering queries as necessary. 

Our experimental evaluation demonstrates the effectiveness when scaling 
the system beyond hundreds of nodes showing super-linear speedup. 



1 Introduction 

A significant number of web-based information systems rely on database tech- 
nology to serve large user communities which makes scalability a key issue for 
the design of web-enabled database systems. Parallel processing and data repli- 
cation, are necessary to deal with the peak loads encountered. Likewise, an ef- 
fective query dispatching scheme is needed to level the system load as well as to 
guarantee quality-of-service in terms of response time. 

In this paper we are concerned with initial experiences with a multi-media 
portal under construction based on the Monet database system p3K99| . The 
system is intended to provide efficient access to a large collection of indexed 
multi-media objects. It is endemic to this kind of information system that user 
interaction is dominated by read accesses. A number of systems with similar 
requirements regarding the deployed database backend have been developed and 
many more are currently under construction. 

With each user interaction, the interface emits a number of queries to the 
database that ideally lead to an answer set of a few tens of candidate results. 
Involving accesses to different multi-dimensional indexes, the evaluation of such 

B. Read (Ed.): BNCOD 2001, LNCS 2097, pp. IfiS- ITCT 2001. 
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queries is usually in the order of few seconds. Still, the queries are of distinctly 
low complexity compared to queries in classical database applications. Moreover, 
the deviation of running time among the queries is limited, not least to ensure 
acceptable response times. 

The primary challenge in this setting is to develop processing techniques 
to optimize the query throughput. Parallel processing is an essential element 
to achieve this, however, a straight forward recasting of methods developed for 
parallel databases does not apply here since most solutions devised in this area 
are almost exclusively geared to tackle highly complex and long running queries. 
There, queries are usually parallelized on a granularity of partial plans or even 
single operators, i.e. single operators like the join of two tables are executed in 
parallel on different nodes involving exchange of partial results among the single 
nodes. However, these techniques are ineffective for the kind of query we are 
considering since communication and coordination overheads would outweigh 
the actual benefits. 

In this paper, we propose a parallel query processing architecture that can 
take advantage of the query characteristic by its physical design, suitable query 
scheduling, and the way queries are executed. 

The platform of operation is a shared-nothing environment — i.e. a cluster of 
inexpensive PCs — where each node runs a Monet server with a fully replicated 
copy of the database. One machine is distinguished as coordinator node that 
dispatches the arriving queries to the servers according to a scheduling strategy. 
The scheduling schema we develop in this paper differs radically from previous 
work as we do not try to model various system parameters in order to exploit 
primarily idle system resources, but take into account what data is memory- 
resident at the servers, i.e. cached by the servers. The algorithm is based on 
a metric that determines the distance between a server and a query — the less 
this distance, the more similar the state of the memory at this server and what 
is required to process this query. Moreover, we investigate possibilities of re- 
ordering and deferred execution of queries to further reduce execution costs. 
Once a query is assigned to a server it is executed in isolation on this server, 
so no synchronization or communication within the cluster is needed freeing 
interconnection bandwidth for shipping of both queries and results. 

Since we are dealing with read-only accesses, we do not have to consider trans- 
action mechanisms to keep the replicated data consistent across the database 
cluster. Rather, the databases are periodically updated by mirroring a master 
database. 

The experimental evaluation of the techniques proposed show substantial 
savings over conventional greedy scheduling that takes only the machines’ work- 
load into account. In a large number of experiments we investigate the impact 
of individual parameters closely. Our results confirm the architectural decisions 
showing excellent scaling behavior. 

The remainder of this paper is organized as follows. We review related work 
in Section |21 In Section 0| we present the architecture and describe the query 
model in Section El Section 0 discusses the modeling of the server pool and 
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the scheduling algorithm. In Section 0 we present a comprehensive performance 
analysis. Section Qcontains a discussion of the design decisions. We conclude the 
paper with Section 0 



2 Related Work 



Parallel query processing has been studied in a large variety of facets, see 



e.g. 



iHanfiiiN mcEsi i|^ 



Most of related work in this field con- 



centrated on possibilities to speedup highly complex queries with long running 
times. Approaches as taken in and suggest a decomposition of the 

query plans into sub-plans which are then executed in parallel on different nodes 
of the parallel processing environment. The granularity of this decomposition 
varies and can be as fine as parallelizing single operator as studied for example 
in IMnsfllMnhOlWfWI but is often chosen coarser IHMhRk 



These approaches have in common that they require communication between 
single nodes for shipping or exchanging partial results. This causes network con- 
tention and synchronization effects where nodes have to wait for others to com- 
plete their tasks first. As a result, a parallelization along these lines only pays if 
the query is of sufficiently high complexity. Otherwise communication overhead 
and synchronization effects outweigh performance gains. 

Moreover, parallel processing as outlined above scales only for small numbers 
of nodes effectively. In shared-nothing architectures, network contention becomes 
increasingly a bottleneck; in the case of shared-everything, the high degree of 
resource sharing limits the scaling |Sto8bllN/T9b| . 

What makes most of these approaches, however, questionable is the fact 
that during the optimization a number of highly sensitive parameter — which 
are hard to model mathematically and impossible to maintain accurately during 
running time — have to be taken into account. Parameters like network load, page 
faults etc. are assumed constant during the query execution and estimate errors, 
which may have severe impact on the performance, can not be corrected (see 

e.g. EMI). 



General memory allocation issues have been explored extensively with re- 
spect to various aspects of query processing. In |MD93) . authors proposed dy- 
namic memory allocation schemes for multi-query workload to level memory 
allocation without sharing resident data among different queries. By analyzing 
queries and their common sub-expressions the re-use of memory resident data 
is often a by-effect |SSN94| . In |MSD93| . authors consider batch scheduling for 
parallel processing. However, main-memory is in both cases transparently viewed 
as a central resource and data location within distributed memory has not been 
considered. 

In the context of transaction processing, several query routing schemas for 
database clusters have been investigated (see e.g. |Tbo8'7lb'G>Jbi93lklj(jOU| l. This 
field of application differs from the problem at hand in its preliminaries: queries 
are usually of complex nature and updates to the database need to be propagated 




Memory Aware Query Routing 171 



over the complete cluster. One of the major goals is for instance to schedule 
queries so that locking conflicts are avoided. For a comprehensive overview on 
this subject see e.g. EM- 



3 Architecture 



Figure Eshows the architecture of our system. It consists of a cluster of database 
servers managed by a central scheduler node. All machines have separate main- 
memory, CPU, and disks not sharing any resources other than network band- 
width. We use Monet, the main-memory database system developed at CWI, 
as database server |PK9RIPK9flj . Besides its vertical fragmented data model, 
Monet is distinguished by its memory awareness, i.e. it solely uses operating sys- 
tem primitives for its memory management, mapping database tables directly 
to virtual memory, to avoid the overhead of a proprietary buffer manager sim- 
ulating a virtual memory layer. Moreover it provides specifically cache aware 
operator implementations |HM Khfij to maximize system utilization on running 
queries. 

Additionally, a pool of web-servers forms the front-end of the system which 
clients interact with. The web-servers receive either parameterized queries using 
text-based forms, or they interact with visual query formulation tools. 

Processing of a client query is done in 7 steps: Users formulate their query 
using the web-interface (Fig.E(l)). At the web-server, the query is re-formulated 
using the internal procedural query representation of the database system — 
in our case MIL, the query language of Monet — and submitted to the query 
scheduler (2). The scheduler maintains a queue of queries that are to be executed. 
By analyzing the data requirements for the execution of a query and the data 
resident in main-memory at the servers, the scheduler determines a favorable 
assignment of queries to servers (4). The query is executed on the assigned node 
(5) and the result is returned to the front-end host (6)0 After formatting the 
result, it is shipped to the user (7). 

This architecture directly aims at the throughput optimization of compact 
queries where an execution on a single node is the most efficient kind of pro- 
cessing. But this architecture also exploits the second query characteristic: the 
impreciseness of the data. For example, data gathered by robots from the Inter- 
net to build up a multi-media index is updated only at low frequency. Thus, the 
server cluster do not have to be kept in sync as would be the case if updates by 
users were allowed. Instead, the databases are updated periodically replicating 
the master database (U). The frequency of updates depends naturally on the 
application domain. 



^ For simplicity of presentation arrows (6) follow the routing of the query (2-4), how- 
ever, the query result can be sent directly to the front-end and does not have to pass 
the query scheduler. 
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Query scheduling 



O Periodical update of replicated databases 



Fig. 1. Query processing architecture 



4 Query Model 

The choice of Monet as database software for the back-end cluster implies a 
specific model for the query execution. The vertical fragmentation of the tables 
in Monet causes bulk-processing to be more efficient than pipelining techniques. 
As a consequence, there are no more than two tables processed on a single CPU 
at a time. Note, this execution model does not impose any restrictions on the 
shape of the execution plans — both linear and bushy plans are feasible. Access 
to a table can be either of type load or SCAN, load reads a table from disk 
into memory to make it part of the hot set, e.g. used with inner relations of 
nested-loop joins. SCAN reads the table, but does not keep the data in memory 
after the actual operation is performed, e.g. used in selections or for the probe 
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relation of a hash join. The building of a hash table can be seen as special type 
of load as the result is only accessible via the hash attribute. 

The costs for executing a relation algebra operator consists of the costs of 
loading/scanning of the table plus the actual costs for the operation. Fitting the 
pieces together, we can describe a query as a sequence 

Q — ((ti5Ui,/i,rci),... , , dji , Iji , 

where each quadruple corresponds to an operator or — as in case of a join — to 
a partial operator. U specifies the table necessary for the operation, at denotes 
the hash attribute, i.e. the operator accesses ti via this attribute; Oi = e if not 
applicable. 

The associated loading costs, if the table is not already resident in main- 
memory, are given by li. We denote the total loading costs of a query as 

L{Q)^J2k. 

i 

The costs to execute the operator, the operator costs, are denoted by Wi. We 
denote the total operator costs of a query as 

W{Q) = 

i 

Note, both li and Wi are expressed in terms of the same unit to achieve a proper 
comparison. 

A queries execution time on a “cold” server, i.e. no data is loaded yet, is 

xiQ) = L{Q) + W{Q). 

If all tables needed by Q are already in memory — and in case of hash tables, 
are hashed by the proper attribute — the execution costs of the query amounts 
to W(Q) only. 

Examples 

Here are some examples to illustrate this modeling based on Monet performance 
characteristics. The two tables A and B used in this example are of size 8.8MB 
and 4.95MB, respectively. Our test platform achieved a bandwidth of 5.5MB/s 
for disk access. 

— Nested-Loop Join, A 1X1 H 

g=((A,e,1.6,0),(H,e,0.9,2.6)) 

Table A takes 1.6s to load, no operator costs occur. Table B takes 0.9s to 
load; performing the join requires 2.6s. 

— Hash Join, A N i? 

Q=((A,a,2.7,0),(H,e,0.9,l.l)) 

Similar to previous but now, A must be a hash table with hash attribute a. 
Hence loading A is more expensive as it includes building the hash table. 
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5 Query Scheduling 

The query scheduling comprises several elements. Besides a model for the servers 
we define a server-query distance which captures the potential re-use of memory 
resident data. Additionally, we also introduce deadlines. 

5.1 Servers 

For the scheduling, a server of the database cluster is modeled by its state of 
memory and the workload. The state of memory is the set of tables resident 
together with a replacement strategy. We are considering base tables only and 
discard intermediate results of the processing as soon as they are no longer used. 
As a replacement strategy we use LRU as it exhibits the best average perfor- 
mance. The loading and dropping of tables is done via the memory mapping 
functionality of the operating system. To maintain sufficient control over the 
memory allocation throughout the complete cluster we load and drop only com- 
plete tables. This way, the scheduler can rely on the information which tables are 
memory-resident, i.e. accessing them will not cause additional costs for swap- 
ping. Swapping may only occur when all unused tables are already dropped but 
the memory requirements of the current operation are still not met. 

For the workload, we distinguish the two states idle and busy, i.e. we assign 
one query to one server at a time. This is not just a simplification to facilitate 
the scheduling but a necessity in main-memory databases where cache awareness 
and concurrent memory access are of distinctly higher importance than in I/O- 
dominated database models |MBK00j . We model the workload as function J (S) 
which returns the expected time of job completion at server S, given its current 
workload, i.e. the expected time from now when S will become idle. If the server 
is idle, J{S) evaluates to 0. J will be used in the scheduler to find the node that 
will finish its job next. J is computed by conventional cost formulae known from 
sequential query processing: Given the time xq the currently running query has 
been assigned to server S, x the time J is evaluated, and e the expected running 
time of the query, the time of job completion computes to J{S) = Xq -\- e — x. 
See also Section 0 for a discussion on the accuracy of J. 

5.2 Distance Metric 

We define the server-query distance as the costs to load the tables for a given 
query Q on a server S\ 



d{Q,S) = ^R{ti,ai) ■ k 

i 



where 



R{ti, Ui) 



0 , 

1 , 



if ti is memory-resident at S and hashed by attribute 
else 
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indicates whether table ti is resident in memory at S. In case ti is required as 
hash table, R also checks whether the table is hashed by attribute ai. 

Scheduling a batch of queries optimally on k servers is finding a division into 
k batches Bi, . . . ,B]^, each of which are executed sequentially on one server, 
such that the running time of the batch with the longest completion time 

^ j ^ 



is minimal. 



5.3 Scheduling Algorithm 

We use the distance measure to develop a greedy scheduling algorithm that estab- 
lishes an acceptable trade-off between workload- and memory-focused schedul- 
ing. Figure 121 shows an outline of our algorithm called Memory Aware Scheduling 
(mas). It iterates over the queue of arriving queries, selecting one at a time, and 
determines the best ad-hoc assignment. 

In detail, we examine the first n elements of the queue — or less if the queue 
does not contain n queries. We investigate the impact of n and suitable values 
for it in the next section. For each of the n queries, we compute c which consists 
of the distance to all servers Si plus the operator cost of the query and the 
expected time at which server St becomes available, J{Si). We record the pair 
(Q, S) with the lowest value for c. After examination of all n queries, we assign 
Q to S' which means that Q will be executed on S as soon as this server gets 
idle. 

The algorithm is in 0(N ■ n ■ s) where N is the number of queries in total, n 
is the number of queries considered in each run, and s is the number of servers 
available, i.e. the algorithm is linear in the number of queries. To give any mean- 
ingful bounds on the performance is particularly difficult because of the LRU 
replacement of tables. 

5.4 Deadlines 

In order to give the user a guarantee of service, we tag every query with a 
deadline. This deadline refers to the latest point in time the query has to be 
assigned to a server for execution, i.e. as soon as the deadline of a query expires, 
the scheduler has no other choice than assigning this very query to a server. 

We tag all queries with a time stamp according to their arrival. In other 
words queries are not forced by deadlines to overtake others, though it is often 
beneficial. As a result, we need only check the first query of the current top 
n batch for deadline expiration. If the first query’s deadline is expired, we do 
not need to examine any other query in the batch but have to assign the first 
immediately to a server. Otherwise, if the first query’s deadline is not yet expired. 
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Algorithm mas 
while queue not empty do 

C-min ^ 00 

foreach query Q in iop^(queue) do 

for i = 1 to number of servers do 

c= d{Q,Si) + W{Q) + J{Si) 
if c ^ Cjjiiji do 
Q ■(- Q 

Si 

C-min ^ C 

done 

done 

if expired{Q) then break 

done 

assign query Q to server S 
remove Q from queue 

done 



Fig. 2. Scheduling Algorithm MAS 



no other deadline can be due. Testing for expiration after the first query has 
been checked against all servers ensures the best assignment in case the deadline 
expired. 

6 Experimental Results 

In this section, we describe experimental results obtained with a simulator. We 
chose to simulate the system in order to experiment with parameters that are 
strictly limited by an actual hardware configuration, such as size of the database 
cluster or main- memory available at the individual servers. 

6.1 Preliminaries 

Since the full multi-media Acoi demonstrator database is still under construc- 
tion, we had to confine the experiments to the part already operational. The 
part chosen is the index system of the ACM Anthology, which is included in 
the demonstrator to assess its capabilities in the area of XML-based database 
processing. We used statistical data available from an actual Monet database 
instance which contains the complete XML code of the Anthology decomposed 
into the vertically fragmented data model jSKWWfHl] . Including indexes, the 
database contains 376 tables with up to nearly 60000 rows. The numbers of ta- 
bles according to their sizes are given in Table H Typical queries use some 5 
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Table 1. Sizes and numbers of base tables 



Size 


< 20MB 


< 30MB 


< 40MB 


< 50MB 


> 50MB 


^Tables 


233 


87 


24 


15 


17 



tables or less, seldom up to 10 or more. We generated batches of 10,000 and 
100,000 queries of exponentially distributed sizes accessing 5 tables on average. 

We do not consider mechanisms to reject user queries due to overload since 
this can be done already at the front-end level. Modeling arrival rates probabilis- 
tically is not necessary as queries do not significantly differ in running time, thus 
information about the queries does not need to be considered to make a choice 
which queries to accept and which to reject. For our experiments, we assume 
the maximal expected query arrival rate to model the worst case behavior with 
maximum load. Moreover, we assume all hosts within the database cluster are 
of identical configuration regarding the cost relevant parameters. 

Since the replacement of tables along an LRU strategy makes analytical 
modeling of the algorithm hard, we implemented Graham’s list scheduling (gls), 
which is based on workload figures |(lra,69] . for comparison. We chose Graham’s 
algorithm instead of seemingly more advanced techniques, like [HEi , as it is 
the only algorithm that does not make assumptions concerning parameters like 
network load etc., which are impossible to maintain accurately at reasonable 
costs (see also Sections El and 0 . 

We adapted the original algorithm to fit the online arrival of queries, i.e. to 
use the kind of look-ahead we introduced in MAS above. Graham’s algorithm is 
known to be highly effective despite its simplicity overcoming some characteristic 
disturbances also known as scheduling anomalies. These anomalies occur with 
jobs that differ significantly in completion time. Please note, we run Graham’s 
algorithm in exactly the same setting, i.e. with LRU replacement of tables as 
necessary. As a result, Graham’s algorithm also profits from re-use of memory 
resident data. 

The main parameters we want to investigate are, foremost, the number of 
servers, the look-ahead during the scheduling, and the amount of memory avail- 
able at each server. Note, there are two principal ways of comparison: a one-on- 
one comparison of both algorithms run on identic configurations of the platform 
(relative performance), or a comparison of the scaling characteristics. Typical 
examples for the latter are speedup and scale-up. We will compare the two al- 
gorithms using both principles where adequate. 

6.2 Warm/Cold Processing 

In this first experiment we investigate the differences between warm and cold 
servers. We refer to a server as cold if no other tables than system tables have 
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Fig. 3. Relative performance of warm over cold servers 



been loaded. If more than 80% of the available memory are allocated we call a 
server warm0 This is not only relevant for later experiments but also for the 
real application scenario when the system has to be shutdown and re-started 
for technical reasons like maintenance etc. We can expect different effects if 
the amount of memory per server is varied. For small server configurations the 
warm up is completed earlier than for large ones. However, more important is the 
amount of memory in total, i.e. the number of servers. FigureOlshows the relative 
execution times of batches of 10,000 queries as function of the number servers. 
For example the leftmost data point for MAS reflects the ratio of execution times 
for MAS on a cold to that on a warm server. The left graph shows times for a 
server configuration of 64, the right for 256MB. 

For GLS, cold and warm execution times are almost the same. Warm process- 
ing is only up to 5% quicker. For MAS the ratio differs significantly showing gains 
of more than 35%. Especially for the range up to 50 servers, the larger config- 
uration (right) achieves better performance, i.e. the curve is steeper. For more 
servers, the impact of the larger amount of memory decreases: for 100 servers 
and more, results are virtually equal. We address the issue of memory sizes in 
more detail in Iti.hl 

All further results presented in this section are obtained from warm servers. 



6.3 Reordering of Queries 

The next experiment investigates the impact of the look-ahead during schedul- 
ing, i.e. the maximal number of queries that may be re-ordered between each 
assignment. 

Figure 01 shows execution times for a single server (left) and a cluster of 100 
servers (right). In both cases the server(s) had 64MB of memory. The execution 

^ We determined the value 80 in preliminary experiments. Execution times on servers 
with more than 80% allocated memory did not differ significantly. 
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Look-ahead [queries] 



Fig. 4. Impact of look-ahead; Single server (left) and cluster of 100 servers (right) 



time is shown as a function of the look-ahead, scaled to GLS’s first data point, 
which corresponds to a first-come-first-serve (FCFS) scheduling. The size of the 
complete query batch was 10,000. In the case of one server, the look-ahead is of 
high importance for MAS and savings can amount up to 20% of the execution 
time. GLS, however, does not significantly profit from look-ahead. 

In the case of 100 servers, the situation changes. GLS relatively improves more 
with increasing look-ahead but for MAS hardly any improvement is noticeable, 
though its execution time is substantially below the one of GLS. This is due 
to the fact that in a pool of 100 servers many server offer a very low server- 
query distance and differences are often very small. As a consequence re-ordering 
cannot help finding a significantly lower server-query distance as it did in the 
sequential case. 

In all further experiments we use a look-ahead of 100 queries unless stated 
otherwise. 



6.4 Speedup and Scale-Up 

The two fundamental measures when investigating the performance of parallel 
systems are speedup and scale-up. The previous quantifies the gains when scaling 
up the platform but keeping the problem size constant, the latter describes the 
system’s ability to cope with problem sizes growing proportionally with the 
platform (cf. e.g. inU^ ). 

Figure 0 shows the speedup for a query batch of 100,000 queries evaluated 
on up to 4096 servers with 64MB memory each. As the plot shows, GLS achieves 
slightly sub-linear speedup whereas MAS achieves even super-linear speedup 
which translates to effective re-use of memory resident data. To better illus- 
trate this phenomenon, consider a very basic example using 4 tables A,B,C and 
D of same size such that only two table fit into memory at the same time, and 
a batch of 4 queries: 
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Fig. 5. Speedup 



Fig. 6. Scaleup 



Qi = ((A2,2),(i3,2,2)) 

Q2 = {{C,2,2),{D,2,2)) 

Q3 = {{D,2,2)) 

Q4 = ((A,2,2)) 

On a single machine, the total costs amount to8 + 8 + 4 + 4 = 24. With linear 
speedup, we would expect 12 cost units for a parallelization on two hosts. MAS 
assigns queries Qi and Q4 to one, Q 2 and Q 3 to the other machine which amounts 
to 8 + 2 = 10 costs at each node. Hence, the total execution time is 10 compared 
to 24 on a single node which gives a speedup of 2.4. 

Figure El shows the scale-up for the same server pool configuration. MAS 
maintains a scale-up of about 1.1 even for large configurations whereas GLS 
drops quickly to about 0.8. 



6.5 Direct Comparison 

In our last experiment, we compare the algorithms directly, i.e. we determine the 
ratio of MAS’s execution time to the one of GLS for each individual parameter 
setting. Values below 1.0 indicate that MAS outperforms GLS. As parameter of 
the experiment we vary number of servers, memory available, and look-ahead. 

Figure 0 shows the relative execution time as a function of the number of 
servers and the amount of memory per server. The number of servers varies 
between 5 and 100, memory between 32 and 640MB. As the diagram displays, 
MAS outperformed GLS in all 400 individual experiments achieving execution 
times as short as 40% of those of GLS. However, as the diagram reveals, this 
is no monotonic process, rather, with increasing memory sizes, GLS manages 
to “catch up”, though only to a certain degree. See for example the front row 
where, after MAS increases its lead (0.4 at ca. 160MB) it cannot further improve 
on the running time whereas GLS becomes increasingly better. For more than 
ca. 320MB neither algorithm can achieve any improvement, thus the plateaux. 
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Fig. 7 . Relative performance; 
memory available and number of 
servers varied 



Fig. 8. Relative performance; 
look-ahead and number of servers 
varied 



Figure IHl shows the relative performance as function of the number of servers 
and the look-ahead. All servers had 64MB of memory. The plot shows results that 
affirm the previously found ones now also in the direct comparison: For small 
number of servers look-ahead plays an important role (see last row), which fades 
as the number of server increases (see front row). 



7 Discussion 

While implementing and testing different versions of MAS we made several design 
decisions which are not self-evident at first sight and deserve to be discussed more 
elaborately in the following. 

The most important question in the design process was whether to include 
more system information about the servers or not. Typically, cost models in 
parallel databases try to model the system — particularly the shared resources like 
network or shared disks — as detailed as possible. We chose not to incorporate 
this kind of information for three reasons: Firstly, the type of query we are 
dealing with cannot profit much from the kind of parallel processing these cost 
models have been developed for. Secondly, a cost computation that takes details 
at this fine granularity into account is computationally too expensive to deliver 
cost estimates for hundreds of servers when an online arrival of queries needs 
to be scheduled in real-time. Lastly, this detailed system information is hard to 
maintain. To keep it accurately up-to-date during the processing would require a 
substantial share of both network and processing resources and is thus unfeasible. 

In contrast to that the cost values used by MAS are of simpler nature. The 
cost estimates for the query execution time (see Section IFTTIl can be assembled 
from sequential cost estimates. Since the queries are emitted by a fixed interface, 
there is the possibility to pre-compile queries which are then instantiated with a 
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few parameters. In that case highly accurate cost estimates can be pre-computed 
and — like the queries — instantiated with parameters. 

The information necessary to describe the system for the scheduling is in so 
far easy to maintain as it consists only of a feedback indicating that processing 
of the last query has terminated, i.e. the result has been shipped, and what 
tables are now in main-memory. To avoid idleness between feedback and new 
assignment, servers can be extended to buffer one query while processing the 
previously assigned. The feedback together with the fact that servers do not 
maintain own query queues ensures that the impact of a few, inaccurate cost 
estimates, which can never be completely avoided, is kept down to a minimum. 

Finally, the query execution we proposed does not allow for multi-pro- 
gramming, i.e. running several queries concurrently on one server. This choice 
was motivated by two facts. Firstly, multi-programming is not very beneficial in 
main-memory databases like Monet as concurrently processed queries often get 
into each other’s way, unlike in I/O dominated database systems. Secondly, and 
more importantly as this holds for other database back-ends too, the potential 
gains of multi-programming where I/O of one query and CPU intensive compu- 
tation of another one can be aligned is very limited as it is our foremost goal to 
avoid costly I/O operations at all. 

8 Conclusion 

Web-enabled databases and database back-end technology for large web-base 
information systems are one of the fastest growing segments of the database 
market. Those systems challenge the traditional repertoire of optimization tech- 
niques used in database technology. Powerful, user interfaces for multi-media 
object retrieval concurrently submit large numbers of queries to the database 
back-end which make throughput optimization a primary optimization goal. 

In this paper, we propose a parallel query processing architecture and inves- 
tigated the possibility to exploit data sharing by clever scheduling of the arriving 
queries. We have developed mas, a scheduling strategy that tries to maximize 
the re-use of data resident in main memory across the database cluster. The 
algorithm is distinguished by its simplicity and robustness on one hand — the 
information needed to make MAS work is easy to obtain, accurate, and needs 
only little effort to be kept up-to-date — and its effectiveness on the other hand. 

Our experiments show superior results compared to conventional list schedul- 
ing in terms of both query throughput and scaling behavior confirming our con- 
siderations. Moreover, by re-using the data available rather than assigning data 
actively, the scheduling algorithm adapts to changing hotspots. 

Our future work is geared toward extending the scheduling schema to consider 
intermediate results and exploit similarities among queries. 

Acknowledgements. Thanks are due to Albrecht Schmidt who provided us 
with a Monet database instance of the ACM Anthology. 
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Requirements of an OQL Query 

A. Trigoni and G.M. Bierman 

University of Cambridge Computer Laboratory, UK 



Abstract. In this paper, we present an inference algorithm for OQL 
which both identifies the most general type of a query in the absence of 
schema type information, and derives the minimum type requirements a 
schema should satisfy to be compatible with this query. Our algorithm 
is useful in any database application where heterogeneity is encoun- 
tered, for example, schema evolution, queries addressed against multiple 
schemata, inter-operation or reconciliation of heterogeneous schemata. 
Our inference algorithm is technically interesting as it concerns an ob- 
ject functional language with a rich semantics and complex type system. 
More precisely, we have devised a set of constraints and an algorithm 
to resolve them. Our resulting type inference system for OQL should be 
useful in any open distributed, or even semi-structured, database envi- 
ronment. 



1 Introduction 

The ODMG Standard 0 (hereafter referred to as simply the Standard) presents, 
rather informally, some details of a type system for checking OQL queries using 
type information about the classes, extents, named objects and query definitions 
from a given database schema. Recently there have been some efforts to formalise 
this type system m- This paper builds on our earlier work 0 and considers 
the problem of inferring the most general type of an OQL query in the absence 
of any schema information. 

For example, consider the following OQL definition and query: 

define Dept_Managers (dept) as 
select e 

from Employees as e 

where e .posit ion=" manager" and e . department =dept ; 
select d 

from Departments as d 

where count (Dept_Managers (d) ) >5 

This query yields those departments that have more than five managers. It is 
interesting to notice that this information could be drawn by running the query 
against databases with significantly different schemata. For instance, consider 
schema A, which has two classes. Employee and Department, defined as follows. 

B. Read (Ed.): BNCOD 2001, LNCS 2097, pp. I8.=j- E77T1 2001. 

@ Springer- Verlag Berlin Heidelberg 2001 



186 



A. Trigoni and G.M. Bierman 



class Employee (extent Employees) class Department (extent 

■( attribute string name; Departments) 

attribute string position; { attribute string id;}- 

attribute int year_of _birth; 

attribute float salary; 

attribute Department department;} 

On the other hand, consider a second schema B, which has a class Employee 
and a named collection object Departments of type List(int). 

class Employee (extent Employees) 
attribute string name; 
attribute string position; 
attribute int department ; } 

The query could potentially run against both A and B without causing any 
type errors. In the case of schema A, the result of the query would be a bag of 
Department objects. In a database with schema B, the result of the query would 
be a bag of integers. Two vital questions arise at this point. First, how we can 
draw limits, or put restrictions, on the properties of a schema, so that a certain 
query is well-typed with respect to it? Second, what information we can derive 
about the type of the result of the query, supposing that we have no specific 
schema in mind? In this paper, we study these two questions in detail, but first 
let us consider the setting where this could be important. 

For example, this information could be exploited in distributed database 
applications. Suppose we have time critical queries addressed against multiple 
schemata. If frequent updates on parts of these schemata are likely to occur, 
then many of the queries will inevitably fail to be executed. In order to avoid 
this situation, we should register interest in specific updates of each schema 
-at least in those that would affect the critical queries- and resolve the type 
incompatibility in due course and not at the time the queries get executed. 

Our work is equally useful in contexts where we need to achieve inter- 
operation between heterogeneous sources. There has been a lot of research on 
reconciling schemata with semantic heterogeneity m One approach to this 
problem identifies the semantic inconsistencies of the ontologies in different do- 
mains and creates a global ontology that combines all of them. Another approach 
identifies the intersection of domains where the inconsistencies occur and tries to 
resolve them by introducing matching rules between them. In both cases, queries 
that are initially written to be executed on one domain need to be rephrased 
to fit the needs of more domains. Knowing the schema requirements of a query 
and the schema mappings to a (global or just different) ontology, the task of 
rephrasing queries becomes a trivial automatic process. Suppose that a group 
of airline companies cooperate to create a single uniform system for booking 
tickets. In order to do that they define a global ontology that is very close to 
each of the distinct ontologies. Each query is initially phrased to conform to the 
global ontology and is then transformed to appropriate queries addressed to the 
individual schemata. The transformation is much easier to perform if besides the 
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schema mappings (from one ontology to the other), we are aware of the query 
schema requirements. The latter effectively point out the exact mappings we 
need to use. 

This paper is organised as follows. In section 0 we recall our earlier 0 defi- 
nition of a core OQL — a fragment of the language defined in the Standard, but 
which has the same expressive power. We give a brief overview of the type sys- 
tem of OQL, including the notion of subtyping. In section 0 we study the type 
system of our inference model introducing a new relation between types, called 
more specific. In section 0 we describe the kinds of constraints generated by our 
type inference algorithm, and in section 0, we present an algorithm for resolving 
these constraints. The core of our inference system, the inference rules, are given 
in section 0 Finally, in section 0 we present the inference algorithm which yields 
the most general type of a query along with its type schema requirements. 



2 Core OQL 

In this section we fix the syntax and type system for OQL. This is explained in 
greater detail in an earlier paper H; space restrictions mean that here we simply 
give the syntax for queries and definitiont0 in Figure 0 An OQL program 
consists of a number (maybe zero) of named definitions followed by a query. 

The syntax for OQL types is also given in Figure 0 In what follows we will 
write Col(cr), to denote an arbitrary collection type (set, bag, list or array), with 
elements of type a. 

Implicit in the ODMG model is a notion of subtyping; the underlying idea 
is that a is said to be a subtype of r, if a value of type cr can be used in any 
context in which a value of type t is expected. This we shall write a < t and 
define as the least relation closed under the rules given in Figure 0 

We use the T symbol to denote single inheritance between two classes, 
referred to in the Standard as the ‘‘‘‘derives from^^ relation. To simplify our pre- 
sentation we do not consider interfaces. 

An interesting feature of our subtype relation is the treatment of structures. 
A type a = struct(li: cti, . . . , 1^: CTh) is considered to be a subtype of r = 
struct(li: Ti, . . . , In: Tn) if r is obtained from a by dropping some labels. (In 
fact, we generalise this a little and also allow subtyping between the label types). 
This so-called width-subtyping is an extension to the Standard, but we feel it 
offers considerable flexibility. 

The type system and the subtype relation are given in detail in an earlier 
paper 0. In that work, we aimed at deriving the type of an OQL query given 
specific schema information. In order to do that, we defined typing judgements 
of the form: 



5; Af; Q\~ q: a 

^ Naturally as we are interested in inferring types we drop the requirement that defi- 
nition parameters be explicitly typed. 
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Queries q ;:= 6 | / | j | c | s 

I ^ 

I bag(q, . . . , q) I set(q, . . . , q) | list(q, . . . , q) | array(q, . . . , q) 
I struct(l: q, . . . , 1: q) 

I C(l:q,...,l:q) |q-l| (C)q 

I q[q] I q in q I q() I q(q, • • • , q) 

I f orall X in q: q I exists x in q: q 
I q binop q | unop{q) 

I select [distinct] q 
from (q as x, • • • , q as x) 
where q 

[group by (1: q, • • • , 1: q)| 

[having qj 

[order by (q ascjdesc, • • • , q ascjdesc)] 



Definitions d ::= define x as q 

I define x(x, . . . , x) as q 



Here b, /, i, c, s range over booleans, floats, integers, characters and strings respectively, 
X is taken from a countable set of identifiers, 1 is taken from a conntable set of labels, 
and C ranges over a countable set of class names. We assnme sets of nnary and binary 
operators, ranged over by unop and binop respectively. 



Types a ::= int j float j bool j char j string j void 
I a X ■ ■ ■ X a ^ a 

I bag(cr) I set(cr) j list((T) j array(cr) 

I struct(l: (7, • • • , 1: (j) 

I c 



We assume a distinguished class name Object. 



Sub- typing 



C < Object 



Top 



C C c' 



C < c' 



Sub-Class 



(Tl X • • • X (Tit — >■ T < (t[ X • • ■ 



a < T 

Sub-Fun Sub-Coil 

Col(o-) < Col(r) 



(Ti < n • ■ • (Jk < Tk 

Sub-Struct 

struct(li: (Ti, . . . , Ik: Uk, . . . , Ik+n: Uk+n) < struct(li: n, ...,1k: Vk) 



a < a 



Sub-Refl 



a < a 



a < a 



Sub- Trans 



Fig. 1. Syntax, Types and Subtyping for Core OQL 
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where S are the class definitions, T> are the persistent query definitions and 
Af are the named objects of a specific schema. Q represents the query typing 
environment, i.e. it contains the types of any free identifiers in q. A simple 
example of the typing rules used to derive the type of a query is the following: 

5; "D; Af; Q h q: list((r) 

First-list 

5; 'D\Af \ Q \- f irst(q): a 

In the current context, we have no information about the classes, the query 
definitions, and the named objects, as we have no schema information. The 
problem we address can thus be written ? h q: ?, i.e. given an arbitrary query q, 
can we infer its type and the type of any supporting schemata? 



3 Type and Schema Inference 

In this section, we present the extended type system behind our inference al- 
gorithm and a new relation between types called Generalisation-Specialisation 
relation. We also discuss the notions of Least Upper Bound and Greatest Lower 
Bound of two types that occur frequently in our inference algorithm. 



3.1 Extended Type System 

It turns out to be convenient to extend the notion of type given in Figure ^ an 
example should make this clear. Consider the following: 

define ql as select x from Students as x; 
define q2 as set (first (Students) ) ; 
ql union q2 

Considering the query ql union q2 first, we can infer immediately that ql and 
q2 should be either sets or bags of elements. To represent this we introduce 
a new type constructor set/bag(— ). Moreover, the elements of this collection 
cannot be of a function type, since the Standard does not allow functions to 
be members of a collection; thus we introduce the types nonfunctional and 
function. From the definition ql we infer that Students is some collection 
(set, bag, list or array) of elements of any (non-functional) type. Considering 
the definition q2 we can infer further information about Students. As it is the 
argument of a first operation it must be an ordered collection, i.e. a list or an 
array. Again we introduce a new type constructor list/array(— ). In summary, 
the algorithm should infer that Students is a list or an array of a nonfunctional 
member type t, and that the query ql union q2 is of type bag(r)0 

The above example motivates our need to extend the initial specific types 
(given in Figure 0 with the following so-called general types. 

The Standard [§4.10.11] states that merging a set and a bag results in a bag. 



2 



190 



A. Trigoni and G.M. Bierman 



{any, nonfunctional, atomic, orderable, int/float}U 
{collection(r), set/bag(r), list/array(r), constructor(/i: Ti)}U 
{all types from the core type system with at 
least one component type being a general type, e.g. set(any)} 

where r, are specific or general types. 

Given these general types the resulting type system is then as follows: 

(7 int | float | bool | char | string | void 
I (7 X ■ • ■ X (7 — >■ (7 

I bag(cr) I set((7) | list(a) | array(cr) | struct(l; <7, • ■ • , 1: cr) 

' ( 1 ) 

I any | nonfunctional ^ 

\ atomic | orderable | int/float 

I constructor(l: a, ■ ■ ■ ,1: a) 

I collection((7) | set/bag(cr) | list/array((7) 

This extended type system (P) is coupled with the type hierarchy illustrated 
in Figure 0 It is worth noting that the general types, which are the internal 
nodes of the tree, are not types that can be found in a database schema, but 
rather abstractions or families of types that encapsulate the common features of 
their children. 




) 


class ] 


1 


^ struct j 


\ 


[ set/bag ] 


1 


listyarray J 





^ char j 


\ 1 


^ set ) ( bag ) [ list ] 


1 


^ array ^ 



int ^ float ^ 



Fig. 2. Type Hierarchy 

A type is said to be specific, if it can be derived by the type system given 
in Figure 0 Otherwise, it is said to be general. All the leaves of the hierarchy 
tree (in Figure 0 are specific types, if they are nullary (non parametric) types 
(int, float, char, string, bool) or if they are parametric types with all the 
parameter types being specific types (e.g. set (int)). 
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Given this more general type system, we need to extend our notion of subtyp- 
ing given earlier. We define a new relation, GSR (Generalisation-Specialisation 
Relationship). Given types cr, r, we write tr C r to express that a is more specific 
than T. 

cr < T cr( C (7i • • • (T^ C (Ti, r C 

GSR - Type GSR - Fun 

cr C T ( 7 i X ••• X (Tk —>■ "T C cr( X ••• X cr( —>■ 



colli is child of C0II2 o' C r 

GSR - Coll 

colli(o) C coll2(r) 

where colli, C0II2 are nodes in the sub-tree (figure 0 ) with root collection and 
is child of signifies that C0II2 is a direct or indirect parent of colli or that colli 
and C0II2 are the same. 

constri is child of constr2 cti C n ■ • • Ok C rk 

GSR — Constr 

constn(li: oi, . . . , Ik: Ok, . . . , Ik+ni cTk+n) C constr2(li: ri, ..., 1 k: Tk) 

where constri, constr2 are nodes in the sub-tree (figure 0 with root constructor, 
and is child of signifies that constr2 is a direct parent of constri or constri and 
constr2 are the same. 

atomi is child of atom2 

GSR — Atomic 

atomi C atom2 

where atomi and atom2 are nodes in the sub-tree (figure 0 with root atomic and 
is child of signifies that atom2 is direct or indirect parent of atomi or atomi and 
atom2 are the same. 



GSR - Refi 

cr C O' 



(Jl ^ CT2 CT2 ^ CT3 

GSR — Trans 

CTi C 0-3 



cr / T1 — >• T2 

GSR - NonFun 

a C nonfunctional 



GSR - All 

cr C any 



Given this definition, we can define the Greatest Lower Bound (GLB) and 
the Least Upper Bound (LUB) of two types ti and T2- Insight into these con- 
cepts can be gained through the following simple example. Gonsider the types 
Ti = set(atomic) and T2 — set/bag(int). The GLB of the two types is de- 
rived by taking the most specific of the collection constructors, set, and the 
most specific of the parameter types, int. Thus GLB(ri,T2) = set(int). Like- 
wise, for the LUB, we take the most general of the two collection construc- 
tors, set/bag, and the most general of the two parameter types, atomic. Thus, 
LUB(ri,r2) = set/bag(atomic). 

We may now formally present GLB and LUB. In the following definitions, we 
assume that constri, constr2 are nodes in the sub-tree (figure 0 with root 
constructor and that constri is a child of constr2 or constri = constr2. 
Moreover, colli and C0II2 are nodes in the sub-tree (figure 0 ) with root 
collection and colli is a child of C0II2 or colli = C0II2. 
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GLB(ti, T2) = Ti if Ti C T2 A T2 C atomic 

LUB(ti, T2) T if Ti C atomic A T2 C atomicA 
(there exists no t' s.t. (t' t A ti G t' A T2 G t' A t G t')) 

GLB(constri(li: CTi, . . . , It: crt), constr2(li: ctJ, . . . , It: 0-^, . . . , It+t: o-t_t„)) 
constri(li: GLB(cti, a[), . . . , It: GLB(o-t, a^}, . . . , It+n: ^t+n) 

LUB(constri(li: CTi, . . . , It: crt), constr2(li: cr', It: cr', , It+t: 
constr2(li: LUB(cti, ct(), . . . , 1 ^: LUB(crit, cr^)) 

GLB(co11i(cTi), COll2(o‘2)) colli (GLB(cTi, (72 )) 

LUB(co11i((7i), COll2(o‘2)) C0112 (LUB((Ti, ( 72 )) 

GLB(ti X • • • X T]£ ^ ( 7 , X • • • X Tj( ^ (t') LUB(ti, t() X • • • X LUB(rk, r^) —>■ GLB((T, cr') 

LUB(ti X • • • X Tk — >• (7 , X • • • X Tj( ^ cr') GLB(ti, t() X • • • X LUB(rk, r^) LUB(cr, cr') 

LUB(cr, t) any, if cr G function A r C nonfunctional 

LUB(< 7 i, (72) nonfunctional, if Vri, T 2 .( 7 i C ti A (72 C T2 A ti 7^ T2A 
t’ 1,'^2 ^ {atomic, constructor(), collection(ciny)} 



4 Type Compatibility - Constraints 

The inference algorithm we present in section 0 analyses an OQL construct 
and infers the most general type of the query and the schema requirements 
that should be satisfied so that the query is we 11- typed. Before being able to 
present the inference algorithm, we first discuss an important mechanism that 
the algorithm is based upon — the generation of constraints. When the inference 
algorithm analyses a certain query construct, it often infers several relations or 
associations amongst the types of the query and its subqueries. These associa- 
tions are given in the form of constraints. 

For example, the analysis of a query qi union q 2 would generate the con- 
straint that the type r of the query is the merge result of the types ti and T 2 of 
the two subqueries, i.e. r = Merge_Result(ri, T2). In section 0 we show how this 
constraint is simplified and is assimilated in the set of the existing constraints. 

Analysing the query exists x in Customers: x. income > 40,000 our algo- 
rithm generates the following constraints. First, it introduces the constraint 
Ti = Member_Type(r 2 ), where ri is the type of x and T2 is the type of Customers. 
The type of x is expected to be the same as that of the members of the collection 
Customers. Second, the constraint T 3 — Constructor_Member_Type(ri, income) 
is generated, where T 3 is the type of x. income. This signifies that x is a 
constructor type (a structure or a class) with at least one member income 
of type T 3 . Another interesting constraint is that the types of x. income (t 3 ) 
and of the literal 40,000 (int) should be compatible in the sense that they 
than can be compared for inequality. This is expressed by the constraint 
Greater_Less_Than_Compatible(r 3 , int). Later, we will show how this con- 
straint is simplified to the constraint T3 C int/float. 
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In section El we give a set of inference rules, one for each query construct. 
Each rule starts with an existing set of constraints and generates a number 
(possibly zero) of new constraints. The different kinds of constraints generated 
by the rules in our inference algorithm are given below: 



1. Equality_Compatible(ri, . . . , Tn) 


6. To = MergeJResult(Ti, T2) 


2. Greater _Less_Than_Compatible(ri, . . . , Tn) 


7. To = Distinct_Result(Ti) 


II 

CO 


8. T = Member _Type(cr) 


in 


9. T = Constructor_Member_Type(CT, 1 ) 


5. To = Arith_Result(ri, T2) 





We briefly explain the constraints used in our inference model. The constraint 
Equality_Compatible(qi, . . . , qn) is analysed in section 14. II The constraint 
Greater_Less_Than_Compatible(qi, q 2 ) is useful for ensuring typability for 
queries like qi < q 2 . Type equality, and GSR (Generalisation-Specialisation 
Relationship) are handled by constraints 3 and 4. tq = Arith_Result(ri, T2) 
is needed for the type inference of queries of the form qi op q 2 where 
op € Likewise, tq = Merge_Result(ri, T2) arises as a constraint 

from inferencing the type of union, intersect or except query expressions. 
The constraint Tq — Distinct_Result(ri) implies that both tq and ti are 
collection types and the collection constructor of tq is the distinct equivalent of 
the collection constructor of ti. Moreover, r = Member _Type((r) implies that cr 
is a collection type and that r is the type of its members. Finally, the constraint 
T = Constructor_Member_Type((7, l) is used to denote that cr is a class or 
struct type with at least one member 1 of type r. If cr is a class type then 1 
can be any of its properties, relationships or methods. 



4.1 Collections - Membership Type Compatibility 

The first two constraints refer to type compatibility w.r.t. equality or non- 
equality comparison. These constraints arise in OQL constructs that involve 
a merge of two or more elements, or a membership test. For instance, the first 
constraint results from considering a query of the form set(qi, . . . , qn), which 
includes the merge of n query results. 

First of all, we should stress the fact that in order for two values (objects or 
literals) to be eligible as members of the same collection, they should be eligible 
for equality comparison. If two types are compatible (membership- wise) , two val- 
ues of these types may be members of a set. In order to insert an element into a 
set, we need to test if its value is equal to any existing value. Thus, we need to en- 
sure that these values have types which are compatible (equality- wise) . Inversely, 
if two types are compatible equality-wise, then their values may be inserted into 
any collection, therefore these types are also compatible membership- wise. 

The Standard [§4.10] defines recursively when two types are compatible, and 
thus when elements of these types can be put in the same collection. The Stan- 
dard then defines the notion of least upper bound (LUB) of two types to derive 
the type of the collection elements. In a context where we need to check and 
derive the type of a query based on specific type information (from a schema). 
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this approach is sufficient and straightforward. However, in our context, where 
we aim to infer the type of a query without any schema information, the com- 
patibility issue becomes more complicated. The use of LUB to infer the type of 
a query like set(qi, . . . , q„) does not yield the appropriate result, for example 
consider the following. 

define ql as struct(x: 12,y :30) ; 

define q2 as element (select z from People as z where z.x=14); 
set (ql ,q2) 

If we call the inference algorithm on ql and q 2 the inferred types (IT) would be 
struct(x : int,y : int) and constructor(x : int) respectively. The least upper 
bound of these two types is constructor(x : int). Thus, the inferred type of the 
query set(ql,q 2 ) would be set(constructor(x : int)). 

However, the correct inferred type should be set(struct(x : int)), since we 
may not merge objects and structures in the same collection, and therefore we 
know that the constructor should be a struct and not a class. 

To overcome this problem we define another relation between types, namely 
CUB {Compatihle Upper Bound). Intuitively, CUB combines the behaviour of both 
LUB and GLB (Greatest Lower Bound). In the previous example, the CUB of the 
two types IT(qi) and IT(q2) would be derived by taking the most specific of the 
two constructor types (constructor and struct), but the least general of the 
element types ((x : int) and (x : int,y : int)). Before we define CUB, we define 
compatibility (Membership- or Equality- wise) for our typing system. 
Compatibility is recursively defined as follows: 

~ T is compatible with t 

— if CT is compatible with r 

and colli, C0II2 G {collection, set/bag, list/array, set, bag, list, 
array} 

and either the collection constructors are the same or one is child of the 
other in the hierarchy tree then colli(cr) is compatible with coll2(r). 

— Any two class types class_namei and class_nEmie2 are compatible. 

— If (Ti is compatible with r^, V z = 1 , . . . , n 

and constri, constr2 G {constructor, struct} 

and no labels other than li, . . . , In are common in both constructor types 
then constri (li : (Ji, ■ • ■ , In: In: CTh, . . . , 1 ^: CTn) and 

constr2(li: Ti, . . . , In: Tn, I21: T21, . . . , l2m: T2m) are compatible. 

— If (Ti is compatible with Ti, V i= l,...,n and classjname is a 
class type such that Constructor_Member_Type(class_name, li) = Ti, 
Vi = l,...,n, then the types class_name and constructor (Iiicti, 
. . . , In: (Tn, In: iJn, . . . , luj: (Tut) are compatible, provided that no labels other 
than li, . . . , In are common members in the two types. 

— If cr, T G {atomic, orderable, int/float, int, float, char, string, bool} 
and either they are the same, or one is a child of the other in the hierarchy 
tree, or one is int and the other is float 

then cr is compatible with r. 
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— If either of the types is nonfunctional and the other type is not a function 
type then these types are compatible. 

— If either of two types is any, then these types are compatible. 

Note that we do not define compatibility for function types, as no two func- 
tion values may be members of the same collection or may be compared for 
equality. Only the results of function application may be considered for compat- 
ibility. 

Given that two types are compatible (based on the recursive definition 
above), their CUB is defined recursively and in accordance with the compatibility 
category that they fall into. 

— CUB(t, t) = r. 

— If colli (ct) is compatible with co112(t) and colli is a child of (or the same 
as) C 0 II 2 , then CUB(co11i(ct), coll 2 (r)) = colli(CUB(cr, r)). 

— If the types Ti and T2 are class types, then CUB(ti,T 2) is the least common 
superclass of the two classes. 

— If the types a = constri(li: cti, . . . , In: dn, In: CTh, . . . , lik: Cik) and 
T = constr2(li: ti, . . . , In: Tn, I21: T21, . . . , 12m: T2m) are compatible, 
where constri, constr2 G {constructor, struct} then 

CUB(cr, r) = constri(li: CUB(cti, Ti), . . . , In: CUB(an, Tn)). 

— If CT = classjiamei and Constructor_Member_Type(class_namei, li) = a±, 

V i == 1 , . . . ,n and r = constructor(li: Ti, . . . ,ln:Tn, 12 i:t 2 i, . . . ,l2m:T2m) 
then CUB(ct, r) is the least superclass of class_namei, say class_name2, sat- 
isfying the following condition: For all l}, if is a property or a relationship 
of classjname2, if Constructor_Member_Type(class_name2, if) = 
then there exists k, 1 < k < n, s.t. if = Ik A CUP(rk, ct^) C ())j, 

V j = 1, . . . ,m, m < n. 

— If CT is compatible with r, then 

1 . if either of them is int/float or one is int and the other is float then 
CUB(ct, r) = int/float 

2 . else 

CUB(ct,t) = GLB(cr,r). 

— If O' = nonfunctional and r C nonfunctional then CUB(cr, r) = r. 

— If (7 = any then, for any type r, CUB(cr, r) = r. 

As discussed earlier, the notion of CUB is used for the inference of the 
types of queries like set(qi, . . . , q^). However, the OQL construct qi in q 2 
raises another issue of a slighly different nature. The Standard [§ 4 . 10 . 8 . 3 ] states 
that if the type of q 2 is coll(r) then the type of qi should be t. This is 
not the case in our context. Suppose that the type of q 2 is inferred to be 
bag(struct(x: int,y: string)); then according to the Standard qi should have 
the type struct(x: int,y: string). Since a value of type struct(x: int) could 
potentially be added in the collection q 2 (that is, since struct(x: int) and 
struct(x: int, y: float) are compatible types), there is no reason why qi could 
not be of type struct(x: int) or even struct(). 
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The same situation occurs when dealing with a collection of objects of dif- 
ferent classes. Suppose lub_class(li: CTi, . . . , In: a^) is the LUB of all classes of 
the objects in the collection and Object is the most general class that all other 
classes derive from (the top of the class hierarchy). We should be able to check 
whether an object of type object_class(lj: cr(, . . . , l(j: a'^) is a member of the 
collection, even if its class is not a subclass of lub_class(li: cri, . . . , Iq: an). This 
allows more queries to (safely) type-check, for example: 

select X 

from People as x 

where x. father in School_Teachers 

does not type-check according to the Standard. In order to be type-correct, 
X. father requires an explicit type cast, i.e. (School_Teacher)x. father. The 
problem is that this query, despite being well-typed, can generate a run- 
time error (if the cast does not succeed). We choose not to enforce that 
X. father has a type which is more specific than the member type or the col- 
lection School_Teachers. Rather, we simply ensure that x. father could po- 
tentially be a member of School_Teachers. To do this we add the constraint 
Equality_Compatible((T, t), where a is the type of x. father and r is the mem- 
ber type of School_Teachers. 



5 Resolving Constraints 

Now that we have studied the kinds of constraints that are generated by our 
inference algorithm, we can discuss how these constraints are resolved. When a 
constraint is generated by an inference rule, it is added to the set of existing 
constraints. If this was a simple insertion procedure, we would end up having a 
huge set of constraints, that would include redundant and often incomprehensible 
type information; there is obviously a need to resolve the inserted constraints. 
Due to the complexity of the type system and the expressiveness of the language, 
we have a wide variety of constraints, that cannot be solved using a standard 
unification mechanism alone E3- In our system, the insertion of a new constraint 
in a set of existing constraints may have one of the following effects: 

— A constraint is deleted, if it is always satisfied, e.g. the constraint 
Equality_Compatible(set(int), set (float)) 

is always true, so it does not need to be maintained. 

— A constraint raises a type error or exception, if it is never satisfied, e.g. 
set(Employee(najne : string)) C list/array(Employee(name : string)). 

— A constraint might be maintained as it is. This usually occurs when some of 
the types involved are general', it may be that when refined, these types no 
longer satisfy the constraint. Therefore, they must be preserved as required 
schema information. For instance, if t± C set/bag(r)V i =0,1,2 are in the 
set of already produced constraints, the constraint tq = Merge_Result(ri, T2) 
needs to be preserved. 
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— A constraint is often simplified, i.e. replaced by one or more simpler con- 
straints. For instance, the constraint set((r) = set(r) is replaced by the 
simpler one a — t. 

— A constraint occasionally implies one or more constraints. The latter need 
to be added to the set of constraints already produced. For example, the 
constraint Greater_Less_Than_Compatible(ri, T 2 ) is inserted as a new con- 
straint along with the implied constraints ti C orderable and Q 
order able. 

The effect varies depending on the constraint kind, the types involved in the 
constraint and the already existing constraints on these types. The details of the 
constraint resolution algorithm will appear in un- 
it is worth pointing out that the resolution of constraints could take place 
either at the time each constraint is generated {gradual resolution) or at the time 
all the constraints have been produced {accumulative resolution). 

The gradual resolution is very simple, since it usually concerns the insertion 
of a few constraints whose simplification {unification) is straightforward. If their 
simplification produces new constraints then these are simplified as well, until 
no more constraints are produced. 

The accumulative resolution starts from the constraints of the form = T 2 . 
It simplifies them to constraints of the form type_var = r and replaces type_var 
by T in all other constraints that involve type_var. Then it proceeds to simplify 
all other kinds of constraints. If a simplification leads to more constraints, the 
latter are added to the set of unprocessed constraints and are simplified in due 
course. 

6 Type Inference Rules 

Having explained the type system underlying our inference model and the various 
constraints generated and resolved by our algorithm, we are now in a position 
to present the backbone of our work, the inference rules. Note that there is a 
single rule for each OQL construct, and, therefore, the use of the rules by the 
inference algorithm is syntax driven. In the remainder of this section, we present 
a substantial part of the inference rules; the complete set will be given in CH. 

In the following rules, "H signifies the type environment, that is 'H = {vari : 
Ti}, and C denotes the constraints added so far. The inference rules for the literal 
and the identifier queries are given first: 



'H;C \- b\ bool ^ C 'H\C\- i: int => C 'H\C\~ f\ float ^ C 



"H; C h c: char ^ C 77 ; C h s: string ^ C 77 U {x; a}; C h x: cr ^ C 

There are several rules to deal with various collections (sets, bags, lists, ar- 
rays). We just give one representative rule, that concerns the query construct 
set(qi, . . . ,qn). As expected, the rule generates a constraint that ensures that 
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the types of the queries are compatible equality- wise (or membership- wise). We 
also give the rules for accessing the first, last or i-th member of an ordered col- 
lection, as well as checking whether an element belongs in a certain collection. 
The constraint Member_Type(CT) = r denotes that u is a collection (set, bag, list 
or array) with members of type r. 

H-yC \- qi'.Gl ^ C\ ... h qn: (Tn ^ Cn 

'H'yC'r set(qi, . . . , q„): set(CUB((Ti, . . . , cr„)) Cn f\ {Equality_Compatible(cTi, . . . , <t„)} 

Hy C h qi: (7 Cl 

T-L'yC \- f irst(qi): 0 ^ Ci A {Member _Type((r) = <{i} A {a C list/array( 0 )} 

T-L'yC \- qi'. cj ^ Cl T-L'yCi\- q^'.T ^ C2 

'H'yC'r qi[q2]: 0 => C2 A {r = int} A {Member_Type(cr) = <{>} A {cr C list/array((/))} 

T-L'yC \- qi'. cj ^ Cl T-L'yCi\- qi'.T ^ C2 
T-L',C\- qi in q2:bool ^ C2 A {Equality_Compatible(<T, Member_Type(r))} 

The rules for constructing a structure or an object, as well as for access- 
ing a member of a structure or an object are given below. The constraint 
Constructor_Member_Type(cr, £) = t denotes that type cr is a class or a structure 
with a member called £ of type t. 

H-yC \- qi: ai ^ Cl ... C„_i h qn: Cn ^ Cn 

H’yC class_name(li: qi, q^): classjiame ^ C„A 

{(Ti C Constructor_Member_Type(class_name, li)} A ... A 
{(Tn C Constructor_Member_Type(classjname, In)} 

H-yC \- qi: ai ^ Cl ... HyCn-i q^'. (Jn ^ Cn 
H;C struct(li: qi, . . . , In: qn): struct (li : (Ji, . . . , In: Cn) ^ C„ 

H; C qi: T ^ Cl 

T-L'yCV- qi.l: CT ^ Cl A {Constructor_Member_Type(r, 1 ): cr} 

The inference rules for the existential and the universal quantification follow. 
It is worth noting that the variable x is bound in query q2 to the member type 
of the collection qi . 

H;C h qi: (Ti => Cl H U (x: 0 }; Ci h q2: ct2 ^ C2 

T-L'yC\- exists X in qi: q2:bool ^ C2 A {<T2 = bool} A {Member_Type(cri) = 0 } 
H;C h qi: (Ti ^ Cl H U {x: 0 }; Ci h q2: ct2 ^ C2 

T-L'yCV- forall X in qi: q2:bool ^ C2 A {ct 2 = bool} A {Member_Type(cri) = 0 } 

An interesting set of rules concerns the application of methods with or with- 
out parameters. The inferred types of the queries used as arguments are not 
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constrained to be the same as the types of the parameters of the method in- 
volved. They only need to be their subtypes. 



l-i^C \- qi(): (p ^ Cl A {(7 — unit — >• p} 



■H; C h qo: ctq =>■ Co • • • 'H', Cn — i 1“ <^a ^ Cn 



■H; C h qo(qi, • • • , qn): p ^ Cn ^ {<^0 = ri X . . . X ^ 0} A {cri C ri} A ... A C Td} 

The rules concerning the query constructs q binop q or unop q are omitted 
for space reasons. We finish by giving the rule for a simple select query; the 
judgements dealing with a group by or an order by clause can be found in HH. 

C h qi: CTi ^ Cl 
H U {xi:ri};Ci h q2: (T2 ^ C2 

1~L U{xi.Ti,..., Xn— 1 . rn — 1 } 5 Cn — 1 qn . ^ Cn 

H U {xi: Ti, . . . ,Xn:Tn};Cn b qoi: fToi => Coi 
'hi U{xi.Tij..., Xn . Tn} , Cqi b qoo . rr oo Coo 

C b select qoo from qi as Xi, . . . , qn as Xn where qoi: bag(CToo) CqoA 
{Member_Type((Ji) = ti} A ... A {Member_Type(crn) = Tn} A {<toi = bool} 



7 Inference Algorithm 

Having given an overview of the type system, the constraints and the rules 
involved in our inference model, we may now present the core of our work, 
which is the inference algorithm. The algorithm takes as input a query q and 
returns its inferred type, as well as a pair (Tl,C) of a type environment and its 
constraints. This pair is a synopsis of the requirements a schema should satisfy 
so that the query q can be executed against it without any type-errors. 

1. For each free variable var in the query q, TL = TL U {var: new_type_var|. 
Initially C — {}. 

2. Based on the construct of the query q recursively apply the appropriate 
inference rule. 

3. Depending on the unification strategy, either simplify the constraints as soon 
as they are produced (gradual unification) or simplify them all in the end 
after having applied all the inference rules. If the unification process produces 
a type-error then the query is not typable and the algorithm is interrupted. 

4. The final 'H,C include the requirements a schema should have to be compat- 
ible with the query q. The type of q, which is the type inferred by the outer 
inference rule, also satisfies the constraints C. 

8 Related Work and Conclusions 

Fundamental to the work described in this report is the type system for ODMG 
OQL described in an earlier paper P|. Alagic |2| independently gave a number 
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of typing rules for OQL; see our earlier paper for a comparison. The canonical 
reference for work on type systems for database programming languages is the 
work of Buneman and Ohori ^ . The goals of their type inference algorithm are 
identical to ours; both approaches infer the most general type of an expression 
(if one exists) without accessing any schema information, and in this sense deter- 
mine the constraints placed on the schema by the query. However, the underlying 
languages, the type systems, and some parts of the inference algorithms differ 
considerably. Buneman and Ohori introduce kinded types to infer the type of a 
record based on selections of fields on this record. Instead, we use the notion of 
general types; in this way we are able to express general type information not 
only for records, but also for parametric collection types, structures and classes. 
Moreover, due to the syntax of OQL, we define a wider variety of constraints 
than those introduced in their framework and therefore a different algorithm to 
resolve them. 

The work most related to ours arises from studying type systems for ob- 
ject oriented programming languages, see for example ] . However none of 
these studies consider the various issues arising from studying database type 
systems; for example, the complications arising from combining parametric col- 
lection types with subtyping. 

Much research on schema evolution, schema inter-operation, distributed or 
semi-structured database applications has pointed out that there is a need to 
run queries in the presence of changing or heterogeneous schemata, or even in 
the absence of specific schema information. Our work addresses this problem, 
by proposing an inference algorithm for the ODMG query language OQL. This 
algorithm infers the most general type of an OQL query and derives the schema 
information required so that the query can be executed against it without any 
type errors. In contrast to other work, we deal with a rather complex type 
system, which includes atomic types, structures, classes, various (parameterised) 
collection types (set, bag, list, array) and function types. This, in connection 
with the rich semantics of OQL, results in the generation of a wide variety of 
constraints by the inference rules. We discuss the semantics of these constraints 
and provide a mechanism for their solution. Finally, we present a set of inference 
rules for OQL, which is the core of our type inference algorithm. Based on our 
experience, this algorithm, as well as all the formalisms prior to it, are easy to 
implement, and hence, we believe that they could prove to be useful in many 
applications. 
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Abstract. This paper presents a flexible system, DIVE-ON, for the pur- 
pose of visual data mining. A new approach to interactively visualize and 
explore N-dimensional data warehouses in an immersed virtual environ- 
ment is put forth. DIVE-ON is capable of constructing a multidimen- 
sional data model on a remote system, transporting pertinent views to 
a CAVE, creating an immersed virtual environment and providing an 
interactive data mining toolset. DIVE-ON architecture emphasizes the 
development of two independent subsystems, a visualization environment 
and a virtual data warehouse. The first objective of our research is to 
examine the possibility of effective mining, and manipulating views with 
little or no instructional help by providing an environment that is built 
around the human’s visual, sensorimotor, and spatial knowledge acquisi- 
tion abilities. The second goal is to create a highly transparent and cen- 
tralized data warehouse that integrates various distributed data sources. 
Within the warehouse, DIVE-ON incorporates an XML-based multidi- 
mensional query language (XMDQL) to circulate the queries among the 
distributed data sources. 



1 Introduction 

Data mining and data visualization have become two essential tools in data 
analysis. While data mining relies on algorithms, structures and operations for 
data extraction, visualization techniques rely on advances in computer graphics 
and the human visual system to convey this extracted knowledge. Datamining 
in an Immersed Virtual Environment Over a Network, DIVE-ON, is the name 
of a system that utilizes advances in virtual reality, databases, and distributed 
computing to develop a new approach to visual data mining through the use 
of Virtual Reality ( VR) . The system’s high modularity ensures that deploying 
various visualization technologies and various data warehouse architectures can 
be achieved independently of one another. 

At an amazing rate, corporations worldwide are mining their data to learn 
more in many areas including fraud, client purchasing patterns, fleet utiliza- 
tion, credit applications and health care outcome analysis. Significant research 
efforts have been devoted towards facilitating the ability to access pertinent in- 
formation, which may be hidden beneath massive data volumes that have been 
collecting in distributed repositories for many years. Data warehouses have been 
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developed to aid in our request for a framework that is well suited for the purpose 
of Knowledge Discovery in Databases (KDD). Data warehouses incorporate a 
multidimensional data model that is designed for the use of knowledge workers 
m (upper management and analysts) for online analytic processing (OLAP) 
so that historical data can be quickly presented in various views and degrees of 
abstraction. 

Although a notable amount of research and development has been devoted 
to the research and development of data mining algorithms, distributed DBMS, 
VR, and data visualization, the amalgamation of all these advances into one 
body is rather new. Due to very restrictive limitations in hardware technology, 
multidimensional data visualization applications have been employed towards 
commercial use only in the early 1990s. DIVE-ON is our attempt to advance 
this process a step further by combining advances in VR, computer graphics, 
and data mining into a flexible system that can be used effectively with little 
or no training. This paper is organized into three main sections. The complete 
system end-to-end is presented first, then the concept of immersion in a virtual 
environment is presented along with the state of the art IVE system, CAVE 
theatre (Section I3). The fourth section presents the architecture of the Virtual 
Data Warehouse (VDW) and illustrates how the XML-based (XMDQL) queries 
are formed, parsed, and distributed among the various data sources. 
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Fig. 1. The three components of the DIVE-ON system. 



2 Complete System Overview 

DIVE-ON can be abstracted in terms of three task-specific subsystems, which 
are tightly coupled to provide the services required. FigureQ] shows various layers 
composing the complete system from the server-side (DBMS) to the client-side 
(CAVE). The first subsystem is the Virtual Data Warehouse (VDW) (details 
in Figure |H|) which is responsible for creating and managing the data ware- 
house over the distributed DBMS. A main component of the virtual warehouse 




204 



A. Ammoura, O.R. Zai'ane, and Y. Ji 



is a shell, Data Cube Constructor Shell (DCC-Shell) that fulfills incoming data 
transportation/query messages (Figure Q (1) and (2)). The second subsystem 
is the Visualization Control Unit (VCU), which is responsible for the creation 
and handling of the virtual world in a manner that maximizes the frame rate 
to ensure that the “reality” in virtual reality is not compromised (Figure ^ (3), 
details in Figure El). The ability to provide interactive virtual data mining tools 
is the task of the third subsystem, the User Interface Manager (UIM) (Figure 
□ (4), details in Figure E|). Inter and intra subsystem data exchange is handled 
via a set of specialized interfaces that implement specific protocols to guaran- 
tee reliability, extendibility and subsystem independence. This communication 
is granted though the implementation of client and server applications by using 
both Common Object Request Broker Architecture (CORE A) Pj over TCP/IP 
and Simple Object Access Protocol (SOAP) over HTTP. In later stages 
of our research, we will evaluate and compare the two implementations. Mes- 
sages between subsystems are transmitted as XML documents, which contain 
the requests and the corresponding responses (Section EJ- In terms of location, 
the VCU and the UIM exist locally in the graphics research facilities at the 
University of Alberta, while the VDW can be distributed and remote. 

3 Immersed Iconographic Visualization 

Iconographic data visualization is a technique that employs icons or glyphs, 
whose visual attributes are bound to the data being examined. Many researchers 
have experimented with the idea of using highly detailed icons or glyphs to 
represent a direct mapping between numerical and visual measure [|12j 0 . Such 
applications place great emphasis on the quantity of a measure and how it can 
be represented as accurately as possible. Other types of visualization techniques 
aim to produce a collective effect such as gradients or islands of contrasting 
textures, which correspond to structures in the data [I^. In contrast, DIVE-ON 
creates a visualization environment on a conceptual level. The Immersed Virtual 
Environment (IVE) is not designed to tell the user that the total sale of a branch 
was an X amount of dollars for example; it is designed to convey the significance 
of this amount. 



3.1 The CAVE Environment 

While the gathering and building of information can be done from any location, 
the actual visualization experience takes advantage of the state-of-the-art vir- 
tual reality environment in Canada at the University of Alberta. This facility is 
called VizRoom that is formally known as CAVE Theatre (CAVE and VizRoom 
are used interchangeably). CAVE is a recursive acronym (Cave Automatic Vir- 
tual Environment) 0 refers to a visualization environment that places the 
user within three (9.5 X 9.5) feet walls (Figure 0). Each of these walls is back- 
projected with a high-resolution projector that delivers the rendered graphics 
at 120 frames per second (Figure Ej). The graphics projected are in stereo (60 
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Fig. 2. A CAVE user within the three back-projected walls. Tl, T2: The head and 
hand-held tracker data stream respectively (Real-time). 



frames per second for each eye), which enables DIVE-ON to create stereoscopic 
views that can be seen by the user by wearing lightweight shutter glasses. This 
type of IVE was chosen for the DIVE-ON over other system for several reasons 
including. 

— The user is free to move naturally without the constraints of the head 
mounted display (HMD). 

— As mentioned above, the DIVE-ON is essentially a decision support tool and 
most likely will be used by a group or a team of analysts. 

— Within the CAVE any view at anytime is instantly available to all for ex- 
amination and discussion (Figure |3|a). Within the walls of the CAVE one 
is able to naturally communicate with others, like hand use to point out an 
interesting data item to others. 

— It is easier to increase realism in a CAVE environment since both the left 
view and the right view are already rendered (on the left and right walls). 
This enables the user access to readily available views in a natural way, by 
simply turning their head. With the HMD, the head orientation is tracked 
(Tl in Figure 121) and used to trigger image rotation in correspondence with 
the user’s head rotation. 

— From previous experiments, hygienic factors were a big issue for some users. 
After all, wearing a helmet with an inside display while walking around will 
definitely make most of us sweat. 



3.2 Creating the Virtual Objects 

The data presented to the user in the IVE is encoded using graphical objects; 
these are the objects that actually make up the rendered virtual world. The 
VCU views the three-dimensional cube it receives from the VDW (Section EJ as 
a three variable function. Each of the three data dimensions becomes associated 
with one of the three physical dimensions, namely X, Y, and Z. Since each entry 
in the data cube is a structure containing two measures Mi and M 2 , the VCU 
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simply plots the two functions Mi(x, y, z) and M 2 (x, y, z) in Next we will 
present the meaning of these measures. 

Assume that the data warehouse under analysis is built around the theme 
“dollars sold”. An OLAP decision support person is not primarily interested 
in the fact that during the year t the total sale of product p at store s was 
$100,000.00; what is important is the context that this measure occurs within. 
The VCU expresses this context to the user in VR by associating these measures 
with visual cues that are bound to the virtual objects. The first cue we use is 
size, which is associated with the measure Mi (dollars sold) . After normalization, 
Mi{xt,yp, Zs) is used to render a cube (or a sphere) of that length (or radius) 
centered at position (xt,yp, Zs), for some t, p, and s within the data range. 
Figure 0 illustrates this criteria. A user walking among these virtual objects in 
VR becomes almost instantly aware of the relative significance of each value 
without the need for numeric data. 




Fig. 3. A team of immersed users discussing the “dollars sold” data cube, (a) Using 
cubes, (b) A user pointing the direction of flight within 3D-lit spheres. 



The second cue used is the object’s colour. An 8-colour palette is chosen and 
the range of normalized values of the measure M 2 are discretized and mapped 
to the palette. The palette extremes are “red”, representing a “high” measure 
value, and “blue” representing a “low” value. As the magnitude of a measure 
increases in value so does the red content of its associated object. Using colour 
to encode data attributes has many significant applications. For example, at the 
lowest level of aggregation (high granularity) colour can be used to represent 
the deviation from the mean along one of the dimensions, profit margins and 
availability. This is particularly useful for market fluctuation, utilization and 
profitability analysis. 

Furthermore, if the user is viewing the data in a highly summarized view, 
the colour can become a very effective tool that helps the user locate anomalies 
at a lower level. For example, the M 2 value for a month object can represent 
the maximum M 2 found in all the days it aggregates. In Figure 0 a virtual 
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object representing the total revenue for a given year with the colour “red”, 
could indicate that one of the months has a great deviation from the rest of the 
year. This will entice the analyst to reach in virtual reality and “select” that 
object 123 to understand the reason and inquire about the exact figures for that 
year. Alternatively, the user may be interested in knowing that a great stability 
dominates a product category (a “blue” object). 

Figure 13 presents the IVE created by rendering cubes that embody the above- 
described use of visual cues. The three dimensions X, Y, and Z are mapped to 
the data cube dimensions Product, Supplier and Location respectively. Colour is 
associated, in this example, with the profit margin while size with gross revenue 
m- The user may request the data attributes to be encoded using cubes (Figure 
Ola) or spheres (Figure Qb); while rendering cubes is much faster, using spheres 
reveals more information using the same virtual space. 





(a) 



(b) 



Fig. 4. Same data attributes from the same viewpoint using cubes and spheres: (a) 
Aggregates as cubes (b) The arrows point some of the data objects that are occluded 
in (a). 



Using spheres makes it easier to distinguish between objects that are within 
close proximity to one another. To illustrate. Figure 0] shows two snapshot of 
a small-scale version of DIVE-ON running on a PC. Both images represent the 
same data, from the same viewpoint, and using the same distance between object 
centres. The arrows in (Figure 0b) point at few of the objects that were previ- 
ously invisible in (Figure 0a). It is apparent that using spheres “exposes” more 
data items within the same display area. Spheres are indeed a better presenta- 
tion for the data for two main reasons. First, to be able to distinguish between a 
“filled circle” and a sphere the graphics engine must perform shading and light 
calculations of the surface rendered. The variation of the shadow intensity on 
the surface of the sphere is a significant visual cue that aids the user in locat- 
ing where one ends and another begins. Second, a sphere occludes fewer objects 
simply because they require less volume. Suppose that we are representing an 
aggregate of a size-attribute r using a cube of length r and a sphere of diameter r. 
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Since the volume of a sphere is approximately (0.476r^) less than that of a cube, 
using a sphere instead of a cube clears exactly that amount of volume for the 
line of sight to penetrate through. 

3.3 The User Interface Manager (UIM) 

To support various VR devices, tools, and environments it was necessary to 
separate the creation and the application of the virtual world. The User Interface 
Manager (UIM) is the subsystem that is responsible for the reception, filtration, 
formulation, and channeling of all available input streams. All tracker signals 
feed into the UIM for preprocessing (Figure 0 (TI) Tracker Interface). Constant 
update of the user’s location is necessary to determine the initial location of 
the floating menu (Figure Oa), the active menu number, and the current menu 
choices. This information is used by the VCU to immerse the user with the 
graphical objects by providing him/her with the sense of presence. To create 
such an environment, the body motion data collected by the UIM (TI in Figure 
is used to translate the stereo graphics in a manner that simulates the real 
world. For example, if the user walks forward, the appropriate image translation 
would be backwards to create the illusion of “walking” through the data. The 
data stream (T2 in Figure Ej), emitted from the user’s hand, is used to track 
the position of the 3D menu in the virtual world. This is why they are called 
“floating menus,” because their location is mapped, with six degrees of freedom 
(6-DOF), to the user’s hand 0. 
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Fig. 5. User Interface Manager. The Tracker Interface (TI) receives the real-time 
tracker input stream and channels it according to type. The set of interaction param- 
eters is then fed to the VCU. 



In almost every IVE application the essence of computer-human interaction 
can be categorized as object manipulation, viewpoint manipulation, or applica- 
tion control m- The reason for this taxonomy is the fact that simulating reality 
is all about simulating the changing views around us along with the ability to 
interact with the objects that make up these views. As a result, an interface that 
is highly transparent must use the human sensorimotor system to its advantage 
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by providing the proper visual feedback to the user’s motion. Understanding 
these issues helped us provide an interface that requires little or no instructions; 
however, it should be clear that the user is required to have an understanding of 
the data source and OLAP operations. All the interactive capabilities of DIVE- 
ON have been grouped in adherence with the above categorization. The users 
chosen in our experiments were people that are familiar with the architecture of 
the data warehouse being viewed and with the terminology and methodologies 
of data mining. 



3.4 The Visualization Control Unit (VCU) 

The VCU is the module responsible for generating and managing the Immersed 
Virtual Environment (IVE) for data visualization and exploration and should 
be viewed only as such. This means that the specifics of the VDW should be 
of no concern to the VCU developer and vice versa. To implement this abstract 
view, each of the VCU and VDW are constructed within a wrapper that pro- 
vides the only mean of relaying messages between the two subsystems. A simple 
communication protocol that defines a set of requests (VCU to DCC) and their 
corresponding replays (DCC to VCU) is implemented in DIVE-ON. This design 
effectively hides the implementation details and allows the VCU and the DCC 
to be independent of one another. After the DCC completes the creation of the 
N-dimensional data cube it signals the VCU (via the DCC-Shell). Since we are 
generating a 3D virtual world, only a subset of the available dimensions can be 
viewed at any given time (up to 7 dimensions: 3 physical dimensions in addition 
to the colour, size, animation and possibly sound P2|). The given dimensions 
that are chosen by the user are extracted from the N-dimensional data cube and 
a sub-data cube is passed to the VCU for rendering. In light of the above dis- 
cussion, what level of abstraction does the sub-cube represent? In other words, 
is the sub-cube highly summarized or highly abstracted? Since the user will be 
placed within an IVE, it is imperative that the delay to the user’s actions is 
kept at a minimal. For this reason, DIVE-ON relies on the VCU to perform 
the data aggregation required (generating less detailed data). This effectively 
reduces network dependence to a minimum. 

As illustrated in FigureEI the VCU contains a module called “VR Partition.” 
This module is used to provide DIVE-ON with the scalability needed when 
handling massive amounts of data. As the amount of rendered objects increase, 
the frame rate of the image transformations decrease. This effectively reduces 
the sense of presence since the latency between the user’s motion and the image 
update has increased. To prevent such situations, only portions of the virtual 
world that are visible to the user are actually rendered. Using a special type 
of Octrees to partition the virtual world based on the user’s virtual location is 
implemented by the VCU. The details of the approach are not within the focus 
of this paper and will be published separately. 
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Fig. 6. The VCU Architecture. SI and Cl are the SOAP and CORBA client Interfaces 
respectively. Output is channeled to Left, Front, and Right projection stereo signals. 




(a) (b) 

Fig. 7. : (a) An exocentric presence within the IVE. (b) Using the hand-held tracker 
to fly through the data. Once the user locates an interesting locality, they may stop 
the flight mode and start walking for detailed exploration. 



4 The Virtual Data Warehouse (VDW) 

The Virtual Data Warehouse (VDW) is a conceptualization of a centralized data 
warehouse that includes a set of distributed data sources and a shell (DCC- 
Shell) that is responsible for managing and querying these sources (Figure Ej). 
The DCC-shell does not store any actual transaction (raw data). Instead, these 
transactions are left on their original sites while the DCC builds and updates a 
global multidimensional model of the available dimensions and measures, hence 
the name “virtual warehouse” . This approach provides the VDW clients a con- 
stantly updated and global view of an N-dimensional data cube in a manner 
that makes the source distribution transparent. Although the DCC-Shell does 
not store raw data, it maintains a pool of meta-data (cube global schema) that 
is synchronized with all data sources. These meta-data are prepared when con- 
structing the VDW, and when a part is modified, all sites must be updated to 
avoid inconsistencies. Besides meta-data, DCC-Shell also maintains a resource 
allocation table that includes information pertaining to the location, data orga- 
nization, and the communication method of each data source. All the meta-data 
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and resource allocation data are in XML format, so it is easy to understand and 
maintain, therefore making the whole system extensible and flexible. 



Data Source 1 



DCC Shell 



Data Source 2 




Fig. 8. The Virtual Data Warehouse (VDW) architecture. 



Similar to a traditional warehouse, the virtual warehouse provides querying 
capabilities to a requesting client. There are three classes of query functions 
that are available to a client. First is the Warehouse query, which is designed to 
provide the client with basic VDW structure including the number of data cubes, 
and the measures and main theme of each cube. Many phisically or conceptually 
different data cubes can be services by the virtual data warehouse. The Cube 
schema query provides the meta-data of one specific data cube in the VDW. 
This meta-data includes a depiction of all available dimensions, measures, and 
the concept hierarchies that further describe each dimension. Finally, Cube data 
query is used to obtain an entire N-dimensional cube or any subset of it. This is 
particularly useful in applications such as the VCU that handles low dimensional 
cubes for visualization sessions (usually 3 dimensions but up to 7 (23)- query 
requests and responds are in XML format, analogous to the way meta-data is 
stored within the DCC-shell. In the next section, the relay of XML messages 
within the system is presented which is then followed by schema formulation 
and the XMDQL language. 



4.1 Inter/Intra Subsystem Communications 

All DIVE-ON internal messages communicated between the VCU and the VDW, 
and between the VDW and the data sources are formed as XML documents. The 
VDW is designed as a stand-alone server capable of fulfilling incoming inquires 
from any client. As seen in Figure 0 the server software (CORBA and SOAP 
servers) is built using a Java API layer, which is regarded as a middleware that 
interprets incoming requests and encodes outgoing replies between the server 
modules and the DCC. These server modules are capable of providing their 
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services by constructing a set of objects that a client can remotely call upon 
(invoke), and awaits the receipt of the XML document containing the response. 
Adhering to this mechanism, the VCU handles the IVE, OLAP and user inter- 
action by using a client module to invoke the DCC-Shell object responsible for 
the particular service sought. 

Inter- VDW communication is also implemented as client-server using 
CORBA and SOAP. To allow the DCC-Shell control over the distributed data 
sources, each source is built within a wrapper working either as a SOAP or 
a CORBA server. The DCC-Shell in turn uses a client module to access each 
source. This approach provides a solution to the problem of managing heteroge- 
neous data sources in a manner that is transparent to the DCC. For example, 
regardless of whether the back-end source is a real data warehouse, relational 
database or even legacy system, the DCC-Shell is always provided with a uni- 
formed interface. The DCC-Shell CORBA and SOAP server objects (Figure El 
provide the functions needed for warehouse management and query. In a typical 
session a client first establishes connection to the VDW and then requests in- 
formation pertaining to its contents. Having learned about the existing entities 
within the VDW, a client may then initiate a cycle requesting any given available 
cuboid. 



4.2 Schema Formulation 

Since the VDW maintains a global and an up-to-date schema that combines all 
the distribute resources, it is important to choose a representation that facilitates 
reads, updates, and modifications. For this reason, the schema is represented 
using XML. Constructing an XML schema for the warehouse general structure 
is rather simple, consequently we illustrate our method by discussing the data 
cube schema that is used to fulfill the cube schema query. 

The Cube Schema is defined mainly by a set of measures and dimensions. 
Within an XML document of the VDW, the root element CubeSchema has sub- 
elements of Measures and Dimensions, which consist of Measure elements and 
Dimension elements. Each measure element has a name attribute to identify 
itself, and an aggregationFunction attribute with value “SUM” or “COUNT” 
to express how to aggregate the measure data. Measure also has several sub- 
elements such as Title, DataType and Description. Here is an example of Measure 
elements: 

<Measures> 

<Measure name="Unit_Sales" aggregationFunction="SUM"> 

<Title>Unit Sales</Title> 

<DataType>double</DataType> 

</Measure> 

<Measure name="Store_Cost" aggregationFunction="SUM"> 
<Title>Store Cost</Title> 

<DataType>double</DataType> 

</Measure> 
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<Measure name="Store_Sales" aggregationFunction="SUM"> 
<Title>Store Sales</Title> 

<DataType>double</DataType> 

</Measure> 

<Measure name="Sales_Count" aggregationFunction="Count"> 
<Title>Store Cost</Title> 

<DataType>double</DataType> 

</Measure> 

</Measures> 

The concept hierarchy defined on a dimension is expressed using levels of 
the ontology used. It is straightforward to write down dimension information in 
XML format given that the attributes forming the concept hierarchy are related 
by a total order relation (Figure El- In some instances however, the attributes 
of a dimension may be organized in a partial order forming a lattice [II I) . For 
example, attributes of the dimension “time” could be organized as a partial 
relation such as: “day < {month < quarter] week} < year”. In such a relation, 
a roll-up operation from the level “day” is considered ambiguous since there 
are two ways to ascend the concept hierarchy, namely “week” or “month” . It is 
hence not possible to represent the concept hierarchy as a tree since partial order 
produces a cycle. Removing the attribute “week” produces the totally ordered 
concept hierarchy (Figure El. 




O Year 
O Quarter 
O Month 

ODay 



Fig. 9. A lattice (partial order) and a tree (total order) concept hierarchy for the 
dimension time. 



To solve the problem of partially ordered concept hierarchies, the attributes 
that cause partial ordering are considered as properties of the dimensions in the 
XML representation. For example, as in Figure 0 putting “week” into “day” 
as a property transforms the relation from partial to total order. So the Time 
dimension has five levels: {Year, Quarter, Month, Day}. Day level has a property 
“Day-of-Week” . The XML fragment could be then written as follows: 
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<Dimension name="Time"> 

<Description>Time dimension for Cube</Description> 

<Levels number="4"> 

<Level name="Year" Title="Year"/> 

<Level name="Quarter " Title="Quarter"/> 

<Level name="Month" Title="Month"/> 

<Level name="Day" Title="Day" type="base"> 

<Property name="Day_of _Week" datatype="String" /> 

</level> 

</Levels> 

<Unit name="1997"> 

<Unit name="Ql"> 

<Unit name=" January"> 

<Unit name="l" baseID="l"> <Property>Wednesday</PropertyX/Unit> 
<Unit name="2" baseID="2"> <Property>Thursday</PropertyX/Unit> 
<Unit name="3" baseID="3"> <Property>Friday</PropertyX/Unit> 



</Unit> 

</Unit> 

</Dimension> 



4.3 XML Multidimensional Query Language (XMDQL) 

To be able to interact with the VDW, a form of querying must be in place. We 
propose an XML-based declarative query language, XMDQL, a language that 
allows the user to express multidimensional queries on the VDW. The concept 
of a special multidimensional query language was first proposed (still not final- 
ized) by Pilot software ^H| as an industry standard. They called their language 
MDSQL. In OLAP terminology this type of query is equivalent to slicing and 
dicing the data cube. DIVE-ON defines XMDQL as a query language that is for- 
matted in XML to query a cube data from VDW; it also provides functionality 
similar to that of MDX (Multidimensional Expressions) . The result of executing 
an XMDQL query could be a cell, a two-dimensional slice, or a multidimensional 
sub-cube. To specify a cube, XMDQL must contain information about the four 
basic subjects: (1) The cube being queried, (2) dimensions projected in the result 
cube, (3) slices in each dimension and (4) the members from a non projected 
dimension on which data will be filtered for members from projected dimensions. 
The basic form of the XMDQL is as follows: 

<XMDQL> 

<SELECT> 

Project dimensions and slices 
</SELECT> 

<FRDM> 

Which cube to query 
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</FROM> 

<WHERE> 

Filtering constraints 
</WHERE> 

</XMDQL> 

For example, given a data cube with four dimensions: Location, Time, Prod- 
uct and Customer, the following XMDQL query would retrieve the sales of office 
products in the USA for each quarter in 1997. The Product and Time dimen- 
sions are projected dimensions, each in one slice, while the Location dimension 
is expressed in the WHERE clause as a filter. 

<XMDQL> 

<SELECT> 

<Measure name="Store_Sales"> 

<Axis dimension="Product"> 

<Slice type="mono" title="${nEmie}"> 

<Path>0f f ice . *</Path> 

</Slice> 

</Axis> 

<Axis dimension="Time"> 

<Slice type="mono" title="Quarter ${name}"> 

<Path>1997 . *</Path> 

</Slice> 

</Axis> 

</SELECT> 

<FRDM> 

<Cube ncmie="AllElectronics"/> 

</FR0M> 

<WHERE> 

<Condition dimension="Location"> 

<Path>N_America.USA</Path> 

</Condition> 

</WHERE> 

</XMDQL> 

The Axis element represents projected dimension, and the Slice element rep- 
resents each piece of a slice cut from this dimension. The Path element indicates 
the conceptual level from the concept hierarchy to retrieve the data from, as well 
as the constituents of the data to retrieve. For instance, the path= “Office.*” 
means showing all children of Office such as computer, fax, copier, etc. 

4.4 Query Distribution and Execution 

A VCU query is first received by the DCC-Shell interface (through ORB/SOAP 
Server) and forwarded to the Query Distributor through the DCC. The Query 
Distributor analyzes the query and, without translating the XMDQL query. 




216 



A. Ammoura, O.R. Za'iane, and Y. Ji 



broadcasts it to all the relevant data sources based on the resource allocation 
table that contains information pertaining to the sources content and their com- 
munication protocols. Each data source then translates the incoming query into 
an appropriate form. A specific wrapper transforms the XMDQL query from the 
global schema to a set of queries in the local schema after pruning attributes 
that are not applicable to the local data source. Each data source executes the 
query (or set of queries) and forms an XML document that contains the needed 
cube data. This information is relayed back to the VDW via the server software 
installed on the distributed sites. All the returned information is then merged by 
the VDW to form an N-dimensional data cube. For our visualization purposes, 
the DCC then extracts the VCU-requested data cube and sends it back to be 
rendered in the CAVE. The distribution information could be represented in 
XML as follows: 

<Distribution dimension="Store"> 

<Component path= " N_Amer i ca . USA " 

mart="DataiMartl">USA sales data</Component> 

<Component path="N_America. Canada" 

mart="DataiMartl">Cauiada sales data</Component> 
</Distribution> 



5 Conclusions and Future Directions 



In this paper we have presented a system prototype for visual data mining in an 
immersed virtual environment. Since the very early days of computing science 
with extremely limited needed technologies, scientists have been fascinated with 
virtual reality (VR). VR systems are capable of abstracting complex problems 
or scenarios in an easy to understand way by exploiting the human’s natural 
skills including the visual system. The CAVE theatre is a new technology that 
maximizes the utilization of the human sensory system. What is novel about our 
research is the fact that with DIVE-ON we have focused this technology into 
a new direction, namely remote visual data mining. As was presented in |S|, 
DIVE-ON is indeed capable of creating an IVE that interactively allows the user 
to explore and learn from a distributed set of data with little or no instructional 
help. To properly evaluate the effectiveness of this technology, we are currently 
formally assessing our system by comparing interaction capabilities, discovery 
opportunities and user satisfaction with implementation on a common screen 
with OpenGL, the DIVE-ON as described in this paper, and an immersed virtual 
environment in a smaller version of the CAVE called Cavelet. The Gavelet is a set 
of three flat back-projected screens, approximately 3x4 feet each, surrounding 
the three sides of a desk. The user also uses head and hand-held trackers as in 
Figure 121 but is sitting down. 

A major problem with visualizing data in three dimensions is the problem of 
occlusion. We have addressed some of these concerns by using spheres in data 
representation fSection lU.4ll : however, it is computationally prohibitive to render 
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a large number of 3D spheres while maintaining an interactive frame rate. One 
research issue we plan to investigate is the use of distortion views, also called 
fisheye or detail-in-context, in 3D graphics. 

So far we have exploited the human visual system to convey information 
pertaining to data. In the near future we also plan to experiment with data 
sonification techniques to add audible cues to the IVE. After applying a data 
mining algorithm, if a potentially useful piece of information that has otherwise 
been unknown is revealed, an audible cue would be generated. The sound at- 
tribute would not only indicate the significance of the finding but also aid the 
user in locating its origin within the data space. 

DIVE-ON creates a virtual world that is inhabited by one type of geomet- 
ric objects (spheres or cubes) each one of which uses its colour and size to tell 
something about what it represents. We would like to examine the possibility of 
increasing the number of data mining measures presented by introducing more 
than one type of geometric objects. For example, a pyramid that points upwards 
could be used to indicate the existence of monotonic increase somewhere at a 
lower level, which is particularly useful in market analysis studies. The use of 
such objects could instantly identify several attributes relating to the ever chang- 
ing consumer trends and utilization forecasting. Shapes can also be used as a 
cue, like colour and size, to visualize yet another discretized dimension. We are 
also investigating extensions to XMDQL based on DMQL m in order to include 
data mining constraints for association rules and classification. These XMDQL 
extensions would be used for interactive classification as well as interactive eval- 
uation of association rules from within the immersed virtual environment in the 
CAVE. 
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